Challenge: What is the future selling price of a home?


A home is often the largest and most expensive purchase a person makes in his or her lifetime. Ensuring homeowners have a trusted way to monitor this asset is incredibly important. In this competition, students are required to develop a full-fledged approach to make predictions about the future sale prices of homes. A full-fledged approach constist, at least, in the following steps:
  • Descriptive statistics about the data
  • Data cleaning and pre-processing
  • Defining a modeling approach to the problem
  • Build such a statistical model
  • Validate the outcome of the model

Now, should you ask a home buyer to describe their dream house, they probably wouldn't begin with describing features such as the height of the basement ceiling or the proximity to a railroad. As you will see, the dataset we use in this competition proves that many more features influence price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in a small city in the US, this competition challenges you to predict the final price of each home.
In [1]:
#this is just to remove the warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning)
In [2]:
#installing some libraries
!pip install pandas_profiling
!pip install xgboost
Collecting pandas_profiling
  Downloading https://files.pythonhosted.org/packages/a7/7c/84f15ee705793a3cdd43bc65e6166d65d36f743b815ea517b02582989533/pandas_profiling-1.4.1-py2.py3-none-any.whl
Requirement already satisfied: jinja2>=2.8 in /opt/conda/lib/python3.6/site-packages (from pandas_profiling)
Requirement already satisfied: six>=1.9 in /opt/conda/lib/python3.6/site-packages (from pandas_profiling)
Requirement already satisfied: pandas>=0.19 in /opt/conda/lib/python3.6/site-packages (from pandas_profiling)
Requirement already satisfied: matplotlib>=1.4 in /opt/conda/lib/python3.6/site-packages (from pandas_profiling)
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.6/site-packages (from jinja2>=2.8->pandas_profiling)
Requirement already satisfied: python-dateutil>=2 in /opt/conda/lib/python3.6/site-packages (from pandas>=0.19->pandas_profiling)
Requirement already satisfied: pytz>=2011k in /opt/conda/lib/python3.6/site-packages (from pandas>=0.19->pandas_profiling)
Requirement already satisfied: numpy>=1.9.0 in /opt/conda/lib/python3.6/site-packages (from pandas>=0.19->pandas_profiling)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.6/site-packages (from matplotlib>=1.4->pandas_profiling)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/lib/python3.6/site-packages (from matplotlib>=1.4->pandas_profiling)
Installing collected packages: pandas-profiling
Successfully installed pandas-profiling-1.4.1
You are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting xgboost
  Downloading https://files.pythonhosted.org/packages/4b/c4/57e246bc99e45c048f9805f2773e7369f0d30896d19fa089fa1794c7b246/xgboost-0.71.tar.gz (494kB)
    100% |████████████████████████████████| 501kB 1.2MB/s ta 0:00:01
Requirement already satisfied: numpy in /opt/conda/lib/python3.6/site-packages (from xgboost)
Requirement already satisfied: scipy in /opt/conda/lib/python3.6/site-packages (from xgboost)
Building wheels for collected packages: xgboost
  Running setup.py bdist_wheel for xgboost ... done
  Stored in directory: /root/.cache/pip/wheels/4e/6d/1d/0bc23240225fe411315d8abb5d4521b9ff002493ff77515ccc
Successfully built xgboost
Installing collected packages: xgboost
Successfully installed xgboost-0.71
You are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
In [4]:
#import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pandas_profiling
import seaborn as sns


from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor


from sklearn.preprocessing import Imputer,OneHotEncoder,LabelEncoder,StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.decomposition import PCA       

Approach #1

This is going to be a very simple and naive approach of what we already know.
It is also useful to compare it with other later approaches.
We will proceed in the following steps:
  • Descriptive Statistics of the data
  • Data Cleaning and Feature Selection
  • Testing different models and comparing

Descriptive Statistics of the data

We begin by visualising the data to get some more feel of it.
We will import the data first.
In [5]:
def import_data():
    train=pd.read_csv("Challenge Data/train.csv")
    test=pd.read_csv("Challenge Data/test.csv")
    return train,test

train,test=import_data()
Let's get a sense of the data by looking at some rows.
In [6]:
train.head(5)
Out[6]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [7]:
test.head(5)
Out[7]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1201 20 RL 71.0 9353 Pave NaN Reg Lvl AllPub ... 0 0 NaN NaN Shed 0 7 2006 Oth Abnorml
1 1202 60 RL 80.0 10400 Pave NaN Reg Lvl AllPub ... 0 0 NaN NaN NaN 0 3 2009 WD Normal
2 1203 50 RM 50.0 6000 Pave NaN Reg Lvl AllPub ... 0 0 NaN NaN NaN 0 5 2009 WD Normal
3 1204 20 RL 75.0 9750 Pave NaN Reg Lvl AllPub ... 0 0 NaN NaN NaN 0 10 2009 WD Normal
4 1205 20 RL 78.0 10140 Pave NaN Reg Lvl AllPub ... 0 0 NaN MnPrv NaN 0 7 2006 WD Normal

5 rows × 80 columns

We will use a very cool trick to visualise the input data taught from here.
Please note that clicking on "Toggle details" doesn't work unless you use an ipynb file.
In [8]:
pandas_profiling.ProfileReport(train)
Out[8]:

Overview

Dataset info

Number of variables 81
Number of observations 1200
Total Missing (%) 5.9%
Total size in memory 759.5 KiB
Average record size in memory 648.1 B

Variables types

Numeric 38
Categorical 43
Boolean 0
Date 0
Text (Unique) 0
Rejected 0
Unsupported 0

Warnings

Variables

1stFlrSF
Numeric

Distinct count 676
Unique (%) 56.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 1157.4
Minimum 334
Maximum 3228
Zeros (%) 0.0%

Quantile statistics

Minimum 334
5-th percentile 672
Q1 882
Median 1087
Q3 1390.2
95-th percentile 1803.4
Maximum 3228
Range 2894
Interquartile range 508.25

Descriptive statistics

Standard deviation 375.24
Coef of variation 0.3242
Kurtosis 1.7089
Mean 1157.4
MAD 296.97
Skewness 0.96693
Sum 1388917
Variance 140800
Memory size 9.5 KiB
Value Count Frequency (%)  
864 18 1.5%
 
1040 13 1.1%
 
912 11 0.9%
 
894 10 0.8%
 
848 9 0.8%
 
672 8 0.7%
 
630 7 0.6%
 
936 7 0.6%
 
816 7 0.6%
 
832 6 0.5%
 
Other values (666) 1104 92.0%
 

Minimum 5 values

Value Count Frequency (%)  
334 1 0.1%
 
372 1 0.1%
 
438 1 0.1%
 
480 1 0.1%
 
483 6 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
2515 1 0.1%
 
2524 1 0.1%
 
2898 1 0.1%
 
3138 1 0.1%
 
3228 1 0.1%
 

2ndFlrSF
Numeric

Distinct count 361
Unique (%) 30.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 344.99
Minimum 0
Maximum 2065
Zeros (%) 57.1%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 728
95-th percentile 1141
Maximum 2065
Range 2065
Interquartile range 728

Descriptive statistics

Standard deviation 437.04
Coef of variation 1.2668
Kurtosis -0.48487
Mean 344.99
MAD 396.1
Skewness 0.83545
Sum 413992
Variance 191000
Memory size 9.5 KiB
Value Count Frequency (%)  
0 685 57.1%
 
728 8 0.7%
 
504 8 0.7%
 
720 7 0.6%
 
546 7 0.6%
 
672 6 0.5%
 
689 5 0.4%
 
840 5 0.4%
 
780 5 0.4%
 
756 5 0.4%
 
Other values (351) 459 38.2%
 

Minimum 5 values

Value Count Frequency (%)  
0 685 57.1%
 
110 1 0.1%
 
167 1 0.1%
 
213 1 0.1%
 
220 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
1589 1 0.1%
 
1796 1 0.1%
 
1818 1 0.1%
 
1872 1 0.1%
 
2065 1 0.1%
 

3SsnPorch
Numeric

Distinct count 18
Unique (%) 1.5%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 3.6533
Minimum 0
Maximum 508
Zeros (%) 98.2%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 0
Maximum 508
Range 508
Interquartile range 0

Descriptive statistics

Standard deviation 29.991
Coef of variation 8.2092
Kurtosis 122.08
Mean 3.6533
MAD 7.1727
Skewness 10.122
Sum 4384
Variance 899.47
Memory size 9.5 KiB
Value Count Frequency (%)  
0 1178 98.2%
 
168 3 0.2%
 
216 2 0.2%
 
180 2 0.2%
 
144 2 0.2%
 
320 1 0.1%
 
245 1 0.1%
 
238 1 0.1%
 
196 1 0.1%
 
182 1 0.1%
 
Other values (8) 8 0.7%
 

Minimum 5 values

Value Count Frequency (%)  
0 1178 98.2%
 
23 1 0.1%
 
96 1 0.1%
 
130 1 0.1%
 
140 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
238 1 0.1%
 
245 1 0.1%
 
320 1 0.1%
 
407 1 0.1%
 
508 1 0.1%
 

Alley
Categorical

Distinct count 3
Unique (%) 0.2%
Missing (%) 93.8%
Missing (n) 1125
Grvl
 
41
Pave
 
34
(Missing)
1125
Value Count Frequency (%)  
Grvl 41 3.4%
 
Pave 34 2.8%
 
(Missing) 1125 93.8%
 

BedroomAbvGr
Numeric

Distinct count 8
Unique (%) 0.7%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.8575
Minimum 0
Maximum 8
Zeros (%) 0.3%

Quantile statistics

Minimum 0
5-th percentile 2
Q1 2
Median 3
Q3 3
95-th percentile 4
Maximum 8
Range 8
Interquartile range 1

Descriptive statistics

Standard deviation 0.8192
Coef of variation 0.28668
Kurtosis 2.3648
Mean 2.8575
MAD 0.58259
Skewness 0.27726
Sum 3429
Variance 0.67109
Memory size 9.5 KiB
Value Count Frequency (%)  
3 658 54.8%
 
2 299 24.9%
 
4 171 14.2%
 
1 44 3.7%
 
5 17 1.4%
 
6 6 0.5%
 
0 4 0.3%
 
8 1 0.1%
 

Minimum 5 values

Value Count Frequency (%)  
0 4 0.3%
 
1 44 3.7%
 
2 299 24.9%
 
3 658 54.8%
 
4 171 14.2%
 

Maximum 5 values

Value Count Frequency (%)  
3 658 54.8%
 
4 171 14.2%
 
5 17 1.4%
 
6 6 0.5%
 
8 1 0.1%
 

BldgType
Categorical

Distinct count 5
Unique (%) 0.4%
Missing (%) 0.0%
Missing (n) 0
1Fam
1001
TwnhsE
 
93
Duplex
 
41
Other values (2)
 
65
Value Count Frequency (%)  
1Fam 1001 83.4%
 
TwnhsE 93 7.8%
 
Duplex 41 3.4%
 
Twnhs 37 3.1%
 
2fmCon 28 2.3%
 

BsmtCond
Categorical

Distinct count 5
Unique (%) 0.4%
Missing (%) 2.7%
Missing (n) 32
TA
1076
Gd
 
53
Fa
 
37
(Missing)
 
32
Value Count Frequency (%)  
TA 1076 89.7%
 
Gd 53 4.4%
 
Fa 37 3.1%
 
Po 2 0.2%
 
(Missing) 32 2.7%
 

BsmtExposure
Categorical

Distinct count 5
Unique (%) 0.4%
Missing (%) 2.7%
Missing (n) 33
No
784
Av
175
Gd
 
113
Value Count Frequency (%)  
No 784 65.3%
 
Av 175 14.6%
 
Gd 113 9.4%
 
Mn 95 7.9%
 
(Missing) 33 2.8%
 

BsmtFinSF1
Numeric

Distinct count 567
Unique (%) 47.2%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 444.89
Minimum 0
Maximum 2260
Zeros (%) 31.3%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 385.5
Q3 712.25
95-th percentile 1270.2
Maximum 2260
Range 2260
Interquartile range 712.25

Descriptive statistics

Standard deviation 439.99
Coef of variation 0.98899
Kurtosis 0.075457
Mean 444.89
MAD 366.63
Skewness 0.80561
Sum 533864
Variance 193590
Memory size 9.5 KiB
Value Count Frequency (%)  
0 376 31.3%
 
24 12 1.0%
 
16 9 0.8%
 
20 5 0.4%
 
662 5 0.4%
 
300 4 0.3%
 
616 4 0.3%
 
588 4 0.3%
 
495 4 0.3%
 
442 4 0.3%
 
Other values (557) 773 64.4%
 

Minimum 5 values

Value Count Frequency (%)  
0 376 31.3%
 
2 1 0.1%
 
16 9 0.8%
 
20 5 0.4%
 
24 12 1.0%
 

Maximum 5 values

Value Count Frequency (%)  
1880 1 0.1%
 
1904 1 0.1%
 
2096 1 0.1%
 
2188 1 0.1%
 
2260 1 0.1%
 

BsmtFinSF2
Numeric

Distinct count 116
Unique (%) 9.7%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 45.26
Minimum 0
Maximum 1474
Zeros (%) 88.8%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 391.25
Maximum 1474
Range 1474
Interquartile range 0

Descriptive statistics

Standard deviation 158.93
Coef of variation 3.5115
Kurtosis 21.374
Mean 45.26
MAD 80.503
Skewness 4.35
Sum 54312
Variance 25259
Memory size 9.5 KiB
Value Count Frequency (%)  
0 1066 88.8%
 
180 5 0.4%
 
117 2 0.2%
 
182 2 0.2%
 
551 2 0.2%
 
480 2 0.2%
 
147 2 0.2%
 
468 2 0.2%
 
279 2 0.2%
 
287 2 0.2%
 
Other values (106) 113 9.4%
 

Minimum 5 values

Value Count Frequency (%)  
0 1066 88.8%
 
28 1 0.1%
 
32 1 0.1%
 
35 1 0.1%
 
40 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
1080 1 0.1%
 
1085 1 0.1%
 
1120 1 0.1%
 
1127 1 0.1%
 
1474 1 0.1%
 

BsmtFinType1
Categorical

Distinct count 7
Unique (%) 0.6%
Missing (%) 2.7%
Missing (n) 32
GLQ
346
Unf
344
ALQ
185
Other values (3)
293
Value Count Frequency (%)  
GLQ 346 28.8%
 
Unf 344 28.7%
 
ALQ 185 15.4%
 
BLQ 124 10.3%
 
Rec 108 9.0%
 
LwQ 61 5.1%
 
(Missing) 32 2.7%
 

BsmtFinType2
Categorical

Distinct count 7
Unique (%) 0.6%
Missing (%) 2.7%
Missing (n) 33
Unf
1034
Rec
 
38
LwQ
 
36
Other values (3)
 
59
(Missing)
 
33
Value Count Frequency (%)  
Unf 1034 86.2%
 
Rec 38 3.2%
 
LwQ 36 3.0%
 
BLQ 29 2.4%
 
ALQ 17 1.4%
 
GLQ 13 1.1%
 
(Missing) 33 2.8%
 

BsmtFullBath
Numeric

Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.42167
Minimum 0
Maximum 3
Zeros (%) 59.1%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 1
95-th percentile 1
Maximum 3
Range 3
Interquartile range 1

Descriptive statistics

Standard deviation 0.52034
Coef of variation 1.234
Kurtosis -0.71336
Mean 0.42167
MAD 0.49827
Skewness 0.63666
Sum 506
Variance 0.27076
Memory size 9.5 KiB
Value Count Frequency (%)  
0 709 59.1%
 
1 477 39.8%
 
2 13 1.1%
 
3 1 0.1%
 

Minimum 5 values

Value Count Frequency (%)  
0 709 59.1%
 
1 477 39.8%
 
2 13 1.1%
 
3 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
0 709 59.1%
 
1 477 39.8%
 
2 13 1.1%
 
3 1 0.1%
 

BsmtHalfBath
Numeric

Distinct count 3
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.061667
Minimum 0
Maximum 2
Zeros (%) 94.0%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 1
Maximum 2
Range 2
Interquartile range 0

Descriptive statistics

Standard deviation 0.24748
Coef of variation 4.0133
Kurtosis 15.43
Mean 0.061667
MAD 0.11593
Skewness 3.9756
Sum 74
Variance 0.061248
Memory size 9.5 KiB
Value Count Frequency (%)  
0 1128 94.0%
 
1 70 5.8%
 
2 2 0.2%
 

Minimum 5 values

Value Count Frequency (%)  
0 1128 94.0%
 
1 70 5.8%
 
2 2 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
0 1128 94.0%
 
1 70 5.8%
 
2 2 0.2%
 

BsmtQual
Categorical

Distinct count 5
Unique (%) 0.4%
Missing (%) 2.7%
Missing (n) 32
TA
526
Gd
511
Ex
 
102
(Missing)
 
32
Value Count Frequency (%)  
TA 526 43.8%
 
Gd 511 42.6%
 
Ex 102 8.5%
 
Fa 29 2.4%
 
(Missing) 32 2.7%
 

BsmtUnfSF
Numeric

Distinct count 691
Unique (%) 57.6%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 564.55
Minimum 0
Maximum 2336
Zeros (%) 8.2%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 225
Median 472
Q3 799.5
95-th percentile 1450.4
Maximum 2336
Range 2336
Interquartile range 574.5

Descriptive statistics

Standard deviation 440.39
Coef of variation 0.78007
Kurtosis 0.58593
Mean 564.55
MAD 350.12
Skewness 0.95092
Sum 677464
Variance 193940
Memory size 9.5 KiB
Value Count Frequency (%)  
0 99 8.2%
 
728 8 0.7%
 
572 7 0.6%
 
270 6 0.5%
 
280 6 0.5%
 
384 6 0.5%
 
625 6 0.5%
 
600 6 0.5%
 
440 5 0.4%
 
264 5 0.4%
 
Other values (681) 1046 87.2%
 

Minimum 5 values

Value Count Frequency (%)  
0 99 8.2%
 
15 1 0.1%
 
23 2 0.2%
 
26 1 0.1%
 
29 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
2042 1 0.1%
 
2046 1 0.1%
 
2121 1 0.1%
 
2153 1 0.1%
 
2336 1 0.1%
 

CentralAir
Categorical

Distinct count 2
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0
Y
1122
N
 
78
Value Count Frequency (%)  
Y 1122 93.5%
 
N 78 6.5%
 

Condition1
Categorical

Distinct count 9
Unique (%) 0.8%
Missing (%) 0.0%
Missing (n) 0
Norm
1035
Feedr
 
65
Artery
 
41
Other values (6)
 
59
Value Count Frequency (%)  
Norm 1035 86.2%
 
Feedr 65 5.4%
 
Artery 41 3.4%
 
RRAn 21 1.8%
 
PosN 17 1.4%
 
RRAe 8 0.7%
 
PosA 7 0.6%
 
RRNn 4 0.3%
 
RRNe 2 0.2%
 

Condition2
Categorical

Distinct count 7
Unique (%) 0.6%
Missing (%) 0.0%
Missing (n) 0
Norm
1186
Feedr
 
6
PosN
 
2
Other values (4)
 
6
Value Count Frequency (%)  
Norm 1186 98.8%
 
Feedr 6 0.5%
 
PosN 2 0.2%
 
Artery 2 0.2%
 
RRNn 2 0.2%
 
PosA 1 0.1%
 
RRAn 1 0.1%
 

Electrical
Categorical

Distinct count 5
Unique (%) 0.4%
Missing (%) 0.0%
Missing (n) 0
SBrkr
1095
FuseA
 
80
FuseF
 
21
Other values (2)
 
4
Value Count Frequency (%)  
SBrkr 1095 91.2%
 
FuseA 80 6.7%
 
FuseF 21 1.8%
 
FuseP 3 0.2%
 
Mix 1 0.1%
 

EnclosedPorch
Numeric

Distinct count 104
Unique (%) 8.7%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 22.178
Minimum 0
Maximum 552
Zeros (%) 85.5%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 183.05
Maximum 552
Range 552
Interquartile range 0

Descriptive statistics

Standard deviation 61.507
Coef of variation 2.7733
Kurtosis 10.979
Mean 22.178
MAD 37.93
Skewness 3.1286
Sum 26614
Variance 3783.2
Memory size 9.5 KiB
Value Count Frequency (%)  
0 1026 85.5%
 
112 12 1.0%
 
192 5 0.4%
 
120 5 0.4%
 
144 5 0.4%
 
96 5 0.4%
 
116 4 0.3%
 
156 4 0.3%
 
252 3 0.2%
 
228 3 0.2%
 
Other values (94) 128 10.7%
 

Minimum 5 values

Value Count Frequency (%)  
0 1026 85.5%
 
19 1 0.1%
 
24 1 0.1%
 
30 1 0.1%
 
32 2 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
294 1 0.1%
 
318 1 0.1%
 
330 1 0.1%
 
386 1 0.1%
 
552 1 0.1%
 

ExterCond
Categorical

Distinct count 5
Unique (%) 0.4%
Missing (%) 0.0%
Missing (n) 0
TA
1050
Gd
 
120
Fa
 
26
Other values (2)
 
4
Value Count Frequency (%)  
TA 1050 87.5%
 
Gd 120 10.0%
 
Fa 26 2.2%
 
Ex 3 0.2%
 
Po 1 0.1%
 

ExterQual
Categorical

Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
TA
747
Gd
399
Ex
 
43
Value Count Frequency (%)  
TA 747 62.3%
 
Gd 399 33.2%
 
Ex 43 3.6%
 
Fa 11 0.9%
 

Exterior1st
Categorical

Distinct count 14
Unique (%) 1.2%
Missing (%) 0.0%
Missing (n) 0
VinylSd
417
HdBoard
187
MetalSd
183
Other values (11)
413
Value Count Frequency (%)  
VinylSd 417 34.8%
 
HdBoard 187 15.6%
 
MetalSd 183 15.2%
 
Wd Sdng 169 14.1%
 
Plywood 93 7.8%
 
CemntBd 51 4.2%
 
BrkFace 37 3.1%
 
WdShing 22 1.8%
 
Stucco 20 1.7%
 
AsbShng 15 1.2%
 
Other values (4) 6 0.5%
 

Exterior2nd
Categorical

Distinct count 15
Unique (%) 1.2%
Missing (%) 0.0%
Missing (n) 0
VinylSd
410
MetalSd
180
HdBoard
171
Other values (12)
439
Value Count Frequency (%)  
VinylSd 410 34.2%
 
MetalSd 180 15.0%
 
HdBoard 171 14.2%
 
Wd Sdng 160 13.3%
 
Plywood 122 10.2%
 
CmentBd 49 4.1%
 
Wd Shng 29 2.4%
 
Stucco 21 1.8%
 
BrkFace 20 1.7%
 
AsbShng 14 1.2%
 
Other values (5) 24 2.0%
 

Fence
Categorical

Distinct count 5
Unique (%) 0.4%
Missing (%) 81.1%
Missing (n) 973
MnPrv
 
130
GdPrv
 
50
GdWo
 
38
(Missing)
973
Value Count Frequency (%)  
MnPrv 130 10.8%
 
GdPrv 50 4.2%
 
GdWo 38 3.2%
 
MnWw 9 0.8%
 
(Missing) 973 81.1%
 

FireplaceQu
Categorical

Distinct count 6
Unique (%) 0.5%
Missing (%) 47.0%
Missing (n) 564
Gd
309
TA
261
Fa
 
29
Other values (2)
 
37
(Missing)
564
Value Count Frequency (%)  
Gd 309 25.8%
 
TA 261 21.8%
 
Fa 29 2.4%
 
Po 19 1.6%
 
Ex 18 1.5%
 
(Missing) 564 47.0%
 

Fireplaces
Numeric

Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.61417
Minimum 0
Maximum 3
Zeros (%) 47.0%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 1
Q3 1
95-th percentile 2
Maximum 3
Range 3
Interquartile range 1

Descriptive statistics

Standard deviation 0.64211
Coef of variation 1.0455
Kurtosis -0.22256
Mean 0.61417
MAD 0.57732
Skewness 0.63788
Sum 737
Variance 0.41231
Memory size 9.5 KiB
Value Count Frequency (%)  
0 564 47.0%
 
1 539 44.9%
 
2 93 7.8%
 
3 4 0.3%
 

Minimum 5 values

Value Count Frequency (%)  
0 564 47.0%
 
1 539 44.9%
 
2 93 7.8%
 
3 4 0.3%
 

Maximum 5 values

Value Count Frequency (%)  
0 564 47.0%
 
1 539 44.9%
 
2 93 7.8%
 
3 4 0.3%
 

Foundation
Categorical

Distinct count 6
Unique (%) 0.5%
Missing (%) 0.0%
Missing (n) 0
PConc
534
CBlock
522
BrkTil
118
Other values (3)
 
26
Value Count Frequency (%)  
PConc 534 44.5%
 
CBlock 522 43.5%
 
BrkTil 118 9.8%
 
Slab 20 1.7%
 
Stone 4 0.3%
 
Wood 2 0.2%
 

FullBath
Numeric

Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.5608
Minimum 0
Maximum 3
Zeros (%) 0.6%

Quantile statistics

Minimum 0
5-th percentile 1
Q1 1
Median 2
Q3 2
95-th percentile 2
Maximum 3
Range 3
Interquartile range 1

Descriptive statistics

Standard deviation 0.55217
Coef of variation 0.35376
Kurtosis -0.85784
Mean 1.5608
MAD 0.52389
Skewness 0.070791
Sum 1873
Variance 0.30489
Memory size 9.5 KiB
Value Count Frequency (%)  
2 624 52.0%
 
1 541 45.1%
 
3 28 2.3%
 
0 7 0.6%
 

Minimum 5 values

Value Count Frequency (%)  
0 7 0.6%
 
1 541 45.1%
 
2 624 52.0%
 
3 28 2.3%
 

Maximum 5 values

Value Count Frequency (%)  
0 7 0.6%
 
1 541 45.1%
 
2 624 52.0%
 
3 28 2.3%
 

Functional
Categorical

Distinct count 7
Unique (%) 0.6%
Missing (%) 0.0%
Missing (n) 0
Typ
1117
Min2
 
28
Min1
 
25
Other values (4)
 
30
Value Count Frequency (%)  
Typ 1117 93.1%
 
Min2 28 2.3%
 
Min1 25 2.1%
 
Maj1 12 1.0%
 
Mod 12 1.0%
 
Maj2 5 0.4%
 
Sev 1 0.1%
 

GarageArea
Numeric

Distinct count 402
Unique (%) 33.5%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 472.6
Minimum 0
Maximum 1390
Zeros (%) 5.6%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 334.5
Median 478
Q3 576
95-th percentile 852.05
Maximum 1390
Range 1390
Interquartile range 241.5

Descriptive statistics

Standard deviation 212.72
Coef of variation 0.45011
Kurtosis 0.82872
Mean 472.6
MAD 159.76
Skewness 0.14444
Sum 567125
Variance 45251
Memory size 9.5 KiB
Value Count Frequency (%)  
0 67 5.6%
 
440 43 3.6%
 
576 41 3.4%
 
240 33 2.8%
 
484 28 2.3%
 
528 24 2.0%
 
400 20 1.7%
 
480 20 1.7%
 
288 19 1.6%
 
264 19 1.6%
 
Other values (392) 886 73.8%
 

Minimum 5 values

Value Count Frequency (%)  
0 67 5.6%
 
160 1 0.1%
 
164 1 0.1%
 
180 7 0.6%
 
186 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
1166 1 0.1%
 
1220 1 0.1%
 
1248 1 0.1%
 
1356 1 0.1%
 
1390 1 0.1%
 

GarageCars
Numeric

Distinct count 5
Unique (%) 0.4%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.7633
Minimum 0
Maximum 4
Zeros (%) 5.6%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 1
Median 2
Q3 2
95-th percentile 3
Maximum 4
Range 4
Interquartile range 1

Descriptive statistics

Standard deviation 0.74349
Coef of variation 0.42164
Kurtosis 0.20649
Mean 1.7633
MAD 0.58239
Skewness -0.36896
Sum 2116
Variance 0.55278
Memory size 9.5 KiB
Value Count Frequency (%)  
2 680 56.7%
 
1 303 25.2%
 
3 147 12.2%
 
0 67 5.6%
 
4 3 0.2%
 

Minimum 5 values

Value Count Frequency (%)  
0 67 5.6%
 
1 303 25.2%
 
2 680 56.7%
 
3 147 12.2%
 
4 3 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
0 67 5.6%
 
1 303 25.2%
 
2 680 56.7%
 
3 147 12.2%
 
4 3 0.2%
 

GarageCond
Categorical

Distinct count 6
Unique (%) 0.5%
Missing (%) 5.6%
Missing (n) 67
TA
1093
Fa
 
26
Gd
 
6
Other values (2)
 
8
(Missing)
 
67
Value Count Frequency (%)  
TA 1093 91.1%
 
Fa 26 2.2%
 
Gd 6 0.5%
 
Po 6 0.5%
 
Ex 2 0.2%
 
(Missing) 67 5.6%
 

GarageFinish
Categorical

Distinct count 4
Unique (%) 0.3%
Missing (%) 5.6%
Missing (n) 67
Unf
500
RFn
338
Fin
295
(Missing)
 
67
Value Count Frequency (%)  
Unf 500 41.7%
 
RFn 338 28.2%
 
Fin 295 24.6%
 
(Missing) 67 5.6%
 

GarageQual
Categorical

Distinct count 6
Unique (%) 0.5%
Missing (%) 5.6%
Missing (n) 67
TA
1081
Fa
 
37
Gd
 
9
Other values (2)
 
6
(Missing)
 
67
Value Count Frequency (%)  
TA 1081 90.1%
 
Fa 37 3.1%
 
Gd 9 0.8%
 
Ex 3 0.2%
 
Po 3 0.2%
 
(Missing) 67 5.6%
 

GarageType
Categorical

Distinct count 7
Unique (%) 0.6%
Missing (%) 5.6%
Missing (n) 67
Attchd
718
Detchd
317
BuiltIn
 
70
Other values (3)
 
28
(Missing)
 
67
Value Count Frequency (%)  
Attchd 718 59.8%
 
Detchd 317 26.4%
 
BuiltIn 70 5.8%
 
Basment 14 1.2%
 
CarPort 8 0.7%
 
2Types 6 0.5%
 
(Missing) 67 5.6%
 

GarageYrBlt
Numeric

Distinct count 96
Unique (%) 8.0%
Missing (%) 5.6%
Missing (n) 67
Infinite (%) 0.0%
Infinite (n) 0
Mean 1978.4
Minimum 1900
Maximum 2010
Zeros (%) 0.0%

Quantile statistics

Minimum 1900
5-th percentile 1930
Q1 1961
Median 1980
Q3 2002
95-th percentile 2007
Maximum 2010
Range 110
Interquartile range 41

Descriptive statistics

Standard deviation 24.813
Coef of variation 0.012542
Kurtosis -0.41703
Mean 1978.4
MAD 20.99
Skewness -0.6562
Sum 2241500
Variance 615.68
Memory size 9.5 KiB
Value Count Frequency (%)  
2005.0 51 4.2%
 
2006.0 47 3.9%
 
2004.0 43 3.6%
 
2007.0 40 3.3%
 
2003.0 40 3.3%
 
1977.0 28 2.3%
 
1976.0 25 2.1%
 
2008.0 23 1.9%
 
1999.0 23 1.9%
 
1998.0 23 1.9%
 
Other values (85) 790 65.8%
 
(Missing) 67 5.6%
 

Minimum 5 values

Value Count Frequency (%)  
1900.0 1 0.1%
 
1906.0 1 0.1%
 
1908.0 1 0.1%
 
1910.0 3 0.2%
 
1914.0 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
2006.0 47 3.9%
 
2007.0 40 3.3%
 
2008.0 23 1.9%
 
2009.0 19 1.6%
 
2010.0 3 0.2%
 

GrLivArea
Numeric

Distinct count 761
Unique (%) 63.4%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 1509
Minimum 334
Maximum 4676
Zeros (%) 0.0%

Quantile statistics

Minimum 334
5-th percentile 848
Q1 1125.8
Median 1456
Q3 1764.5
95-th percentile 2452.5
Maximum 4676
Range 4342
Interquartile range 638.75

Descriptive statistics

Standard deviation 517.38
Coef of variation 0.34287
Kurtosis 3.2148
Mean 1509
MAD 393.11
Skewness 1.2105
Sum 1810773
Variance 267680
Memory size 9.5 KiB
Value Count Frequency (%)  
864 16 1.3%
 
1040 12 1.0%
 
894 10 0.8%
 
1456 9 0.8%
 
1200 8 0.7%
 
1092 8 0.7%
 
848 7 0.6%
 
1728 7 0.6%
 
987 6 0.5%
 
1344 6 0.5%
 
Other values (751) 1111 92.6%
 

Minimum 5 values

Value Count Frequency (%)  
334 1 0.1%
 
438 1 0.1%
 
480 1 0.1%
 
520 1 0.1%
 
605 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
3608 1 0.1%
 
3627 1 0.1%
 
4316 1 0.1%
 
4476 1 0.1%
 
4676 1 0.1%
 

HalfBath
Numeric

Distinct count 3
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.3825
Minimum 0
Maximum 2
Zeros (%) 62.4%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 1
95-th percentile 1
Maximum 2
Range 2
Interquartile range 1

Descriptive statistics

Standard deviation 0.49974
Coef of variation 1.3065
Kurtosis -1.1904
Mean 0.3825
MAD 0.47749
Skewness 0.64427
Sum 459
Variance 0.24974
Memory size 9.5 KiB
Value Count Frequency (%)  
0 749 62.4%
 
1 443 36.9%
 
2 8 0.7%
 

Minimum 5 values

Value Count Frequency (%)  
0 749 62.4%
 
1 443 36.9%
 
2 8 0.7%
 

Maximum 5 values

Value Count Frequency (%)  
0 749 62.4%
 
1 443 36.9%
 
2 8 0.7%
 

Heating
Categorical

Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
GasA
1177
GasW
 
15
Grav
 
5
Value Count Frequency (%)  
GasA 1177 98.1%
 
GasW 15 1.2%
 
Grav 5 0.4%
 
Wall 3 0.2%
 

HeatingQC
Categorical

Distinct count 5
Unique (%) 0.4%
Missing (%) 0.0%
Missing (n) 0
Ex
603
TA
357
Gd
203
Other values (2)
 
37
Value Count Frequency (%)  
Ex 603 50.2%
 
TA 357 29.8%
 
Gd 203 16.9%
 
Fa 36 3.0%
 
Po 1 0.1%
 

HouseStyle
Categorical

Distinct count 8
Unique (%) 0.7%
Missing (%) 0.0%
Missing (n) 0
1Story
601
2Story
372
1.5Fin
 
120
Other values (5)
 
107
Value Count Frequency (%)  
1Story 601 50.1%
 
2Story 372 31.0%
 
1.5Fin 120 10.0%
 
SLvl 48 4.0%
 
SFoyer 30 2.5%
 
1.5Unf 13 1.1%
 
2.5Unf 9 0.8%
 
2.5Fin 7 0.6%
 

Id
Numeric

Distinct count 1200
Unique (%) 100.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 600.5
Minimum 1
Maximum 1200
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 60.95
Q1 300.75
Median 600.5
Q3 900.25
95-th percentile 1140
Maximum 1200
Range 1199
Interquartile range 599.5

Descriptive statistics

Standard deviation 346.55
Coef of variation 0.57711
Kurtosis -1.2
Mean 600.5
MAD 300
Skewness 0
Sum 720600
Variance 120100
Memory size 9.5 KiB
Value Count Frequency (%)  
1200 1 0.1%
 
394 1 0.1%
 
396 1 0.1%
 
397 1 0.1%
 
398 1 0.1%
 
399 1 0.1%
 
400 1 0.1%
 
401 1 0.1%
 
402 1 0.1%
 
403 1 0.1%
 
Other values (1190) 1190 99.2%
 

Minimum 5 values

Value Count Frequency (%)  
1 1 0.1%
 
2 1 0.1%
 
3 1 0.1%
 
4 1 0.1%
 
5 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
1196 1 0.1%
 
1197 1 0.1%
 
1198 1 0.1%
 
1199 1 0.1%
 
1200 1 0.1%
 

KitchenAbvGr
Numeric

Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.045
Minimum 0
Maximum 3
Zeros (%) 0.1%

Quantile statistics

Minimum 0
5-th percentile 1
Q1 1
Median 1
Q3 1
95-th percentile 1
Maximum 3
Range 3
Interquartile range 0

Descriptive statistics

Standard deviation 0.21912
Coef of variation 0.20969
Kurtosis 23.474
Mean 1.045
MAD 0.087692
Skewness 4.6148
Sum 1254
Variance 0.048015
Memory size 9.5 KiB
Value Count Frequency (%)  
1 1146 95.5%
 
2 51 4.2%
 
3 2 0.2%
 
0 1 0.1%
 

Minimum 5 values

Value Count Frequency (%)  
0 1 0.1%
 
1 1146 95.5%
 
2 51 4.2%
 
3 2 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
0 1 0.1%
 
1 1146 95.5%
 
2 51 4.2%
 
3 2 0.2%
 

KitchenQual
Categorical

Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
TA
602
Gd
482
Ex
 
81
Value Count Frequency (%)  
TA 602 50.2%
 
Gd 482 40.2%
 
Ex 81 6.8%
 
Fa 35 2.9%
 

LandContour
Categorical

Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Lvl
1079
Bnk
 
49
HLS
 
42
Value Count Frequency (%)  
Lvl 1079 89.9%
 
Bnk 49 4.1%
 
HLS 42 3.5%
 
Low 30 2.5%
 

LandSlope
Categorical

Distinct count 3
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0
Gtl
1135
Mod
 
54
Sev
 
11
Value Count Frequency (%)  
Gtl 1135 94.6%
 
Mod 54 4.5%
 
Sev 11 0.9%
 

LotArea
Numeric

Distinct count 913
Unique (%) 76.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 10559
Minimum 1300
Maximum 215245
Zeros (%) 0.0%

Quantile statistics

Minimum 1300
5-th percentile 3196
Q1 7560
Median 9434.5
Q3 11616
95-th percentile 17105
Maximum 215245
Range 213945
Interquartile range 4056

Descriptive statistics

Standard deviation 10619
Coef of variation 1.0057
Kurtosis 191.96
Mean 10559
MAD 3823.5
Skewness 12.133
Sum 12671294
Variance 112770000
Memory size 9.5 KiB
Value Count Frequency (%)  
9600 20 1.7%
 
7200 19 1.6%
 
10800 12 1.0%
 
9000 12 1.0%
 
6000 11 0.9%
 
8400 10 0.8%
 
1680 8 0.7%
 
9100 7 0.6%
 
6120 7 0.6%
 
3182 7 0.6%
 
Other values (903) 1087 90.6%
 

Minimum 5 values

Value Count Frequency (%)  
1300 1 0.1%
 
1477 1 0.1%
 
1491 1 0.1%
 
1526 1 0.1%
 
1533 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
70761 1 0.1%
 
115149 1 0.1%
 
159000 1 0.1%
 
164660 1 0.1%
 
215245 1 0.1%
 

LotConfig
Categorical

Distinct count 5
Unique (%) 0.4%
Missing (%) 0.0%
Missing (n) 0
Inside
862
Corner
223
CulDSac
 
75
Other values (2)
 
40
Value Count Frequency (%)  
Inside 862 71.8%
 
Corner 223 18.6%
 
CulDSac 75 6.2%
 
FR2 38 3.2%
 
FR3 2 0.2%
 

LotFrontage
Numeric

Distinct count 107
Unique (%) 8.9%
Missing (%) 17.5%
Missing (n) 210
Infinite (%) 0.0%
Infinite (n) 0
Mean 70.087
Minimum 21
Maximum 313
Zeros (%) 0.0%

Quantile statistics

Minimum 21
5-th percentile 34
Q1 59
Median 70
Q3 80
95-th percentile 107
Maximum 313
Range 292
Interquartile range 21

Descriptive statistics

Standard deviation 23.702
Coef of variation 0.33818
Kurtosis 12.508
Mean 70.087
MAD 16.72
Skewness 1.6541
Sum 69386
Variance 561.79
Memory size 9.5 KiB
Value Count Frequency (%)  
60.0 112 9.3%
 
80.0 57 4.8%
 
70.0 55 4.6%
 
50.0 47 3.9%
 
75.0 46 3.8%
 
65.0 38 3.2%
 
85.0 32 2.7%
 
78.0 19 1.6%
 
21.0 19 1.6%
 
90.0 18 1.5%
 
Other values (96) 547 45.6%
 
(Missing) 210 17.5%
 

Minimum 5 values

Value Count Frequency (%)  
21.0 19 1.6%
 
24.0 18 1.5%
 
30.0 5 0.4%
 
32.0 4 0.3%
 
33.0 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
160.0 1 0.1%
 
168.0 1 0.1%
 
174.0 2 0.2%
 
182.0 1 0.1%
 
313.0 1 0.1%
 

LotShape
Categorical

Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Reg
754
IR1
403
IR2
 
37
Value Count Frequency (%)  
Reg 754 62.8%
 
IR1 403 33.6%
 
IR2 37 3.1%
 
IR3 6 0.5%
 

LowQualFinSF
Numeric

Distinct count 22
Unique (%) 1.8%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 6.5533
Minimum 0
Maximum 572
Zeros (%) 98.1%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 0
Maximum 572
Range 572
Interquartile range 0

Descriptive statistics

Standard deviation 52.078
Coef of variation 7.9468
Kurtosis 73.829
Mean 6.5533
MAD 12.855
Skewness 8.5177
Sum 7864
Variance 2712.1
Memory size 9.5 KiB
Value Count Frequency (%)  
0 1177 98.1%
 
80 2 0.2%
 
360 2 0.2%
 
528 1 0.1%
 
53 1 0.1%
 
120 1 0.1%
 
144 1 0.1%
 
156 1 0.1%
 
232 1 0.1%
 
234 1 0.1%
 
Other values (12) 12 1.0%
 

Minimum 5 values

Value Count Frequency (%)  
0 1177 98.1%
 
53 1 0.1%
 
80 2 0.2%
 
120 1 0.1%
 
144 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
513 1 0.1%
 
514 1 0.1%
 
515 1 0.1%
 
528 1 0.1%
 
572 1 0.1%
 

MSSubClass
Numeric

Distinct count 15
Unique (%) 1.2%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 57.075
Minimum 20
Maximum 190
Zeros (%) 0.0%

Quantile statistics

Minimum 20
5-th percentile 20
Q1 20
Median 50
Q3 70
95-th percentile 160
Maximum 190
Range 170
Interquartile range 50

Descriptive statistics

Standard deviation 42.682
Coef of variation 0.74782
Kurtosis 1.5534
Mean 57.075
MAD 31.486
Skewness 1.4096
Sum 68490
Variance 1821.8
Memory size 9.5 KiB
Value Count Frequency (%)  
20 442 36.8%
 
60 253 21.1%
 
50 112 9.3%
 
120 73 6.1%
 
30 56 4.7%
 
160 53 4.4%
 
70 47 3.9%
 
80 44 3.7%
 
90 41 3.4%
 
190 27 2.2%
 
Other values (5) 52 4.3%
 

Minimum 5 values

Value Count Frequency (%)  
20 442 36.8%
 
30 56 4.7%
 
40 3 0.2%
 
45 12 1.0%
 
50 112 9.3%
 

Maximum 5 values

Value Count Frequency (%)  
90 41 3.4%
 
120 73 6.1%
 
160 53 4.4%
 
180 7 0.6%
 
190 27 2.2%
 

MSZoning
Categorical

Distinct count 5
Unique (%) 0.4%
Missing (%) 0.0%
Missing (n) 0
RL
946
RM
 
178
FV
 
55
Other values (2)
 
21
Value Count Frequency (%)  
RL 946 78.8%
 
RM 178 14.8%
 
FV 55 4.6%
 
RH 12 1.0%
 
C (all) 9 0.8%
 

MasVnrArea
Numeric

Distinct count 284
Unique (%) 23.7%
Missing (%) 0.5%
Missing (n) 6
Infinite (%) 0.0%
Infinite (n) 0
Mean 103.96
Minimum 0
Maximum 1600
Zeros (%) 59.2%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 166.75
95-th percentile 456
Maximum 1600
Range 1600
Interquartile range 166.75

Descriptive statistics

Standard deviation 183.53
Coef of variation 1.7654
Kurtosis 11.005
Mean 103.96
MAD 130.54
Skewness 2.7772
Sum 124130
Variance 33685
Memory size 9.5 KiB
Value Count Frequency (%)  
0.0 710 59.2%
 
108.0 8 0.7%
 
72.0 7 0.6%
 
180.0 6 0.5%
 
200.0 6 0.5%
 
16.0 6 0.5%
 
340.0 6 0.5%
 
120.0 5 0.4%
 
196.0 4 0.3%
 
183.0 4 0.3%
 
Other values (273) 432 36.0%
 
(Missing) 6 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 710 59.2%
 
1.0 1 0.1%
 
11.0 1 0.1%
 
14.0 1 0.1%
 
16.0 6 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
1115.0 1 0.1%
 
1129.0 1 0.1%
 
1170.0 1 0.1%
 
1378.0 1 0.1%
 
1600.0 1 0.1%
 

MasVnrType
Categorical

Distinct count 5
Unique (%) 0.4%
Missing (%) 0.5%
Missing (n) 6
None
711
BrkFace
369
Stone
 
100
Value Count Frequency (%)  
None 711 59.2%
 
BrkFace 369 30.8%
 
Stone 100 8.3%
 
BrkCmn 14 1.2%
 
(Missing) 6 0.5%
 

MiscFeature
Categorical

Distinct count 4
Unique (%) 0.3%
Missing (%) 96.1%
Missing (n) 1153
Shed
 
44
Othr
 
2
Gar2
 
1
(Missing)
1153
Value Count Frequency (%)  
Shed 44 3.7%
 
Othr 2 0.2%
 
Gar2 1 0.1%
 
(Missing) 1153 96.1%
 

MiscVal
Numeric

Distinct count 18
Unique (%) 1.5%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 40.453
Minimum 0
Maximum 15500
Zeros (%) 96.2%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 0
Maximum 15500
Range 15500
Interquartile range 0

Descriptive statistics

Standard deviation 482.32
Coef of variation 11.923
Kurtosis 884.77
Mean 40.453
MAD 77.805
Skewness 28.075
Sum 48544
Variance 232640
Memory size 9.5 KiB
Value Count Frequency (%)  
0 1154 96.2%
 
400 10 0.8%
 
500 8 0.7%
 
700 5 0.4%
 
450 4 0.3%
 
2000 3 0.2%
 
600 3 0.2%
 
1200 2 0.2%
 
480 2 0.2%
 
800 1 0.1%
 
Other values (8) 8 0.7%
 

Minimum 5 values

Value Count Frequency (%)  
0 1154 96.2%
 
54 1 0.1%
 
350 1 0.1%
 
400 10 0.8%
 
450 4 0.3%
 

Maximum 5 values

Value Count Frequency (%)  
1300 1 0.1%
 
1400 1 0.1%
 
2000 3 0.2%
 
3500 1 0.1%
 
15500 1 0.1%
 

MoSold
Numeric

Distinct count 12
Unique (%) 1.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 6.3117
Minimum 1
Maximum 12
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 2
Q1 5
Median 6
Q3 8
95-th percentile 11
Maximum 12
Range 11
Interquartile range 3

Descriptive statistics

Standard deviation 2.6731
Coef of variation 0.42352
Kurtosis -0.35847
Mean 6.3117
MAD 2.1071
Skewness 0.20159
Sum 7574
Variance 7.1455
Memory size 9.5 KiB
Value Count Frequency (%)  
6 214 17.8%
 
7 203 16.9%
 
5 159 13.2%
 
4 115 9.6%
 
8 102 8.5%
 
3 86 7.2%
 
10 70 5.8%
 
11 63 5.2%
 
9 51 4.2%
 
12 46 3.8%
 
Other values (2) 91 7.6%
 

Minimum 5 values

Value Count Frequency (%)  
1 46 3.8%
 
2 45 3.8%
 
3 86 7.2%
 
4 115 9.6%
 
5 159 13.2%
 

Maximum 5 values

Value Count Frequency (%)  
8 102 8.5%
 
9 51 4.2%
 
10 70 5.8%
 
11 63 5.2%
 
12 46 3.8%
 

Neighborhood
Categorical

Distinct count 25
Unique (%) 2.1%
Missing (%) 0.0%
Missing (n) 0
NAmes
179
CollgCr
 
120
OldTown
 
95
Other values (22)
806
Value Count Frequency (%)  
NAmes 179 14.9%
 
CollgCr 120 10.0%
 
OldTown 95 7.9%
 
Edwards 81 6.8%
 
Somerst 68 5.7%
 
NridgHt 68 5.7%
 
Gilbert 67 5.6%
 
Sawyer 66 5.5%
 
NWAmes 55 4.6%
 
SawyerW 51 4.2%
 
Other values (15) 350 29.2%
 

OpenPorchSF
Numeric

Distinct count 187
Unique (%) 15.6%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 46.017
Minimum 0
Maximum 523
Zeros (%) 46.0%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 24
Q3 68
95-th percentile 172
Maximum 523
Range 523
Interquartile range 68

Descriptive statistics

Standard deviation 65.678
Coef of variation 1.4273
Kurtosis 7.767
Mean 46.017
MAD 47.583
Skewness 2.2985
Sum 55220
Variance 4313.6
Memory size 9.5 KiB
Value Count Frequency (%)  
0 552 46.0%
 
48 21 1.8%
 
20 17 1.4%
 
36 17 1.4%
 
40 15 1.2%
 
30 14 1.2%
 
45 14 1.2%
 
24 14 1.2%
 
50 12 1.0%
 
60 12 1.0%
 
Other values (177) 512 42.7%
 

Minimum 5 values

Value Count Frequency (%)  
0 552 46.0%
 
4 1 0.1%
 
8 1 0.1%
 
10 1 0.1%
 
11 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
364 1 0.1%
 
406 1 0.1%
 
418 1 0.1%
 
502 1 0.1%
 
523 1 0.1%
 

OverallCond
Numeric

Distinct count 9
Unique (%) 0.8%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 5.5683
Minimum 1
Maximum 9
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 4
Q1 5
Median 5
Q3 6
95-th percentile 8
Maximum 9
Range 8
Interquartile range 1

Descriptive statistics

Standard deviation 1.1201
Coef of variation 0.20116
Kurtosis 1.0861
Mean 5.5683
MAD 0.8928
Skewness 0.62932
Sum 6682
Variance 1.2547
Memory size 9.5 KiB
Value Count Frequency (%)  
5 674 56.2%
 
6 205 17.1%
 
7 167 13.9%
 
8 63 5.2%
 
4 47 3.9%
 
3 22 1.8%
 
9 16 1.3%
 
2 5 0.4%
 
1 1 0.1%
 

Minimum 5 values

Value Count Frequency (%)  
1 1 0.1%
 
2 5 0.4%
 
3 22 1.8%
 
4 47 3.9%
 
5 674 56.2%
 

Maximum 5 values

Value Count Frequency (%)  
5 674 56.2%
 
6 205 17.1%
 
7 167 13.9%
 
8 63 5.2%
 
9 16 1.3%
 

OverallQual
Numeric

Distinct count 10
Unique (%) 0.8%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 6.105
Minimum 1
Maximum 10
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 4
Q1 5
Median 6
Q3 7
95-th percentile 8
Maximum 10
Range 9
Interquartile range 2

Descriptive statistics

Standard deviation 1.3834
Coef of variation 0.22661
Kurtosis 0.13771
Mean 6.105
MAD 1.0991
Skewness 0.19176
Sum 7326
Variance 1.9139
Memory size 9.5 KiB
Value Count Frequency (%)  
5 329 27.4%
 
6 306 25.5%
 
7 264 22.0%
 
8 138 11.5%
 
4 91 7.6%
 
9 37 3.1%
 
3 16 1.3%
 
10 14 1.2%
 
2 3 0.2%
 
1 2 0.2%
 

Minimum 5 values

Value Count Frequency (%)  
1 2 0.2%
 
2 3 0.2%
 
3 16 1.3%
 
4 91 7.6%
 
5 329 27.4%
 

Maximum 5 values

Value Count Frequency (%)  
6 306 25.5%
 
7 264 22.0%
 
8 138 11.5%
 
9 37 3.1%
 
10 14 1.2%
 

PavedDrive
Categorical

Distinct count 3
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0
Y
1107
N
 
69
P
 
24
Value Count Frequency (%)  
Y 1107 92.2%
 
N 69 5.8%
 
P 24 2.0%
 

PoolArea
Numeric

Distinct count 5
Unique (%) 0.4%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.9092
Minimum 0
Maximum 648
Zeros (%) 99.7%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 0
Maximum 648
Range 648
Interquartile range 0

Descriptive statistics

Standard deviation 33.148
Coef of variation 17.363
Kurtosis 305.34
Mean 1.9092
MAD 3.8056
Skewness 17.45
Sum 2291
Variance 1098.8
Memory size 9.5 KiB
Value Count Frequency (%)  
0 1196 99.7%
 
648 1 0.1%
 
576 1 0.1%
 
555 1 0.1%
 
512 1 0.1%
 

Minimum 5 values

Value Count Frequency (%)  
0 1196 99.7%
 
512 1 0.1%
 
555 1 0.1%
 
576 1 0.1%
 
648 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
0 1196 99.7%
 
512 1 0.1%
 
555 1 0.1%
 
576 1 0.1%
 
648 1 0.1%
 

PoolQC
Categorical

Distinct count 4
Unique (%) 0.3%
Missing (%) 99.7%
Missing (n) 1196
Ex
 
2
Fa
 
1
Gd
 
1
(Missing)
1196
Value Count Frequency (%)  
Ex 2 0.2%
 
Fa 1 0.1%
 
Gd 1 0.1%
 
(Missing) 1196 99.7%
 

RoofMatl
Categorical

Distinct count 6
Unique (%) 0.5%
Missing (%) 0.0%
Missing (n) 0
CompShg
1178
Tar&Grv
 
10
WdShngl
 
6
Other values (3)
 
6
Value Count Frequency (%)  
CompShg 1178 98.2%
 
Tar&Grv 10 0.8%
 
WdShngl 6 0.5%
 
WdShake 4 0.3%
 
Metal 1 0.1%
 
Membran 1 0.1%
 

RoofStyle
Categorical

Distinct count 5
Unique (%) 0.4%
Missing (%) 0.0%
Missing (n) 0
Gable
945
Hip
228
Flat
 
12
Other values (2)
 
15
Value Count Frequency (%)  
Gable 945 78.8%
 
Hip 228 19.0%
 
Flat 12 1.0%
 
Gambrel 9 0.8%
 
Mansard 6 0.5%
 

SaleCondition
Categorical

Distinct count 6
Unique (%) 0.5%
Missing (%) 0.0%
Missing (n) 0
Normal
979
Partial
 
104
Abnorml
 
85
Other values (3)
 
32
Value Count Frequency (%)  
Normal 979 81.6%
 
Partial 104 8.7%
 
Abnorml 85 7.1%
 
Family 17 1.4%
 
Alloca 11 0.9%
 
AdjLand 4 0.3%
 

SalePrice
Numeric

Distinct count 596
Unique (%) 49.7%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 181410
Minimum 34900
Maximum 755000
Zeros (%) 0.0%

Quantile statistics

Minimum 34900
5-th percentile 87975
Q1 129900
Median 163700
Q3 214000
95-th percentile 329200
Maximum 755000
Range 720100
Interquartile range 84100

Descriptive statistics

Standard deviation 81071
Coef of variation 0.44688
Kurtosis 7.0339
Mean 181410
MAD 58084
Skewness 1.9672
Sum 217697554
Variance 6572500000
Memory size 9.5 KiB
Value Count Frequency (%)  
140000 16 1.3%
 
135000 16 1.3%
 
110000 12 1.0%
 
155000 12 1.0%
 
145000 11 0.9%
 
115000 10 0.8%
 
185000 9 0.8%
 
160000 9 0.8%
 
190000 9 0.8%
 
139000 9 0.8%
 
Other values (586) 1087 90.6%
 

Minimum 5 values

Value Count Frequency (%)  
34900 1 0.1%
 
35311 1 0.1%
 
37900 1 0.1%
 
39300 1 0.1%
 
40000 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
582933 1 0.1%
 
611657 1 0.1%
 
625000 1 0.1%
 
745000 1 0.1%
 
755000 1 0.1%
 

SaleType
Categorical

Distinct count 9
Unique (%) 0.8%
Missing (%) 0.0%
Missing (n) 0
WD
1036
New
 
101
COD
 
37
Other values (6)
 
26
Value Count Frequency (%)  
WD 1036 86.3%
 
New 101 8.4%
 
COD 37 3.1%
 
ConLD 9 0.8%
 
ConLI 5 0.4%
 
ConLw 5 0.4%
 
CWD 3 0.2%
 
Con 2 0.2%
 
Oth 2 0.2%
 

ScreenPorch
Numeric

Distinct count 65
Unique (%) 5.4%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 14.981
Minimum 0
Maximum 410
Zeros (%) 92.1%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 160.05
Maximum 410
Range 410
Interquartile range 0

Descriptive statistics

Standard deviation 54.768
Coef of variation 3.6559
Kurtosis 15.843
Mean 14.981
MAD 27.59
Skewness 3.9209
Sum 17977
Variance 2999.5
Memory size 9.5 KiB
Value Count Frequency (%)  
0 1105 92.1%
 
192 6 0.5%
 
189 4 0.3%
 
180 4 0.3%
 
120 3 0.2%
 
126 3 0.2%
 
224 3 0.2%
 
160 3 0.2%
 
144 3 0.2%
 
200 2 0.2%
 
Other values (55) 64 5.3%
 

Minimum 5 values

Value Count Frequency (%)  
0 1105 92.1%
 
53 1 0.1%
 
60 1 0.1%
 
63 1 0.1%
 
90 2 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
322 1 0.1%
 
374 1 0.1%
 
385 1 0.1%
 
396 1 0.1%
 
410 1 0.1%
 

Street
Categorical

Distinct count 2
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0
Pave
1194
Grvl
 
6
Value Count Frequency (%)  
Pave 1194 99.5%
 
Grvl 6 0.5%
 

TotRmsAbvGrd
Numeric

Distinct count 12
Unique (%) 1.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 6.4942
Minimum 2
Maximum 14
Zeros (%) 0.0%

Quantile statistics

Minimum 2
5-th percentile 4
Q1 5
Median 6
Q3 7
95-th percentile 10
Maximum 14
Range 12
Interquartile range 2

Descriptive statistics

Standard deviation 1.6147
Coef of variation 0.24863
Kurtosis 0.85135
Mean 6.4942
MAD 1.2736
Skewness 0.68624
Sum 7793
Variance 2.6071
Memory size 9.5 KiB
Value Count Frequency (%)  
6 332 27.7%
 
7 269 22.4%
 
5 231 19.2%
 
8 146 12.2%
 
4 85 7.1%
 
9 62 5.2%
 
10 41 3.4%
 
11 14 1.2%
 
3 11 0.9%
 
12 7 0.6%
 
Other values (2) 2 0.2%
 

Minimum 5 values

Value Count Frequency (%)  
2 1 0.1%
 
3 11 0.9%
 
4 85 7.1%
 
5 231 19.2%
 
6 332 27.7%
 

Maximum 5 values

Value Count Frequency (%)  
9 62 5.2%
 
10 41 3.4%
 
11 14 1.2%
 
12 7 0.6%
 
14 1 0.1%
 

TotalBsmtSF
Numeric

Distinct count 642
Unique (%) 53.5%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 1054.7
Minimum 0
Maximum 3206
Zeros (%) 2.7%

Quantile statistics

Minimum 0
5-th percentile 483
Q1 796
Median 1002.5
Q3 1298.2
95-th percentile 1734.4
Maximum 3206
Range 3206
Interquartile range 502.25

Descriptive statistics

Standard deviation 420
Coef of variation 0.39821
Kurtosis 2.2581
Mean 1054.7
MAD 316.95
Skewness 0.56526
Sum 1265640
Variance 176400
Memory size 9.5 KiB
Value Count Frequency (%)  
0 32 2.7%
 
864 25 2.1%
 
672 14 1.2%
 
1040 12 1.0%
 
912 11 0.9%
 
768 10 0.8%
 
816 10 0.8%
 
780 10 0.8%
 
728 10 0.8%
 
894 9 0.8%
 
Other values (632) 1057 88.1%
 

Minimum 5 values

Value Count Frequency (%)  
0 32 2.7%
 
105 1 0.1%
 
190 1 0.1%
 
264 3 0.2%
 
270 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
2524 1 0.1%
 
3094 1 0.1%
 
3138 1 0.1%
 
3200 1 0.1%
 
3206 1 0.1%
 

Utilities
Categorical

Distinct count 2
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0
AllPub
1199
NoSeWa
 
1
Value Count Frequency (%)  
AllPub 1199 99.9%
 
NoSeWa 1 0.1%
 

WoodDeckSF
Numeric

Distinct count 247
Unique (%) 20.6%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 95.137
Minimum 0
Maximum 857
Zeros (%) 51.3%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 168
95-th percentile 335
Maximum 857
Range 857
Interquartile range 168

Descriptive statistics

Standard deviation 124.03
Coef of variation 1.3037
Kurtosis 2.6453
Mean 95.137
MAD 101.38
Skewness 1.4657
Sum 114164
Variance 15384
Memory size 9.5 KiB
Value Count Frequency (%)  
0 616 51.3%
 
192 34 2.8%
 
144 29 2.4%
 
100 29 2.4%
 
120 29 2.4%
 
168 22 1.8%
 
224 13 1.1%
 
140 13 1.1%
 
240 9 0.8%
 
208 9 0.8%
 
Other values (237) 397 33.1%
 

Minimum 5 values

Value Count Frequency (%)  
0 616 51.3%
 
12 2 0.2%
 
24 1 0.1%
 
26 2 0.2%
 
28 2 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
574 1 0.1%
 
576 1 0.1%
 
670 1 0.1%
 
728 1 0.1%
 
857 1 0.1%
 

YearBuilt
Numeric

Distinct count 108
Unique (%) 9.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 1971.4
Minimum 1875
Maximum 2010
Zeros (%) 0.0%

Quantile statistics

Minimum 1875
5-th percentile 1916
Q1 1954
Median 1973
Q3 2000
95-th percentile 2007
Maximum 2010
Range 135
Interquartile range 46

Descriptive statistics

Standard deviation 30.048
Coef of variation 0.015243
Kurtosis -0.42704
Mean 1971.4
MAD 24.973
Skewness -0.61578
Sum 2365621
Variance 902.91
Memory size 9.5 KiB
Value Count Frequency (%)  
2006 51 4.2%
 
2005 50 4.2%
 
2004 45 3.8%
 
2007 44 3.7%
 
2003 36 3.0%
 
1976 29 2.4%
 
1977 26 2.2%
 
1920 23 1.9%
 
1954 22 1.8%
 
1959 21 1.8%
 
Other values (98) 853 71.1%
 

Minimum 5 values

Value Count Frequency (%)  
1875 1 0.1%
 
1880 4 0.3%
 
1882 1 0.1%
 
1885 1 0.1%
 
1890 2 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
2006 51 4.2%
 
2007 44 3.7%
 
2008 16 1.3%
 
2009 17 1.4%
 
2010 1 0.1%
 

YearRemodAdd
Numeric

Distinct count 61
Unique (%) 5.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 1985
Minimum 1950
Maximum 2010
Zeros (%) 0.0%

Quantile statistics

Minimum 1950
5-th percentile 1950
Q1 1967
Median 1994
Q3 2004
95-th percentile 2007
Maximum 2010
Range 60
Interquartile range 37

Descriptive statistics

Standard deviation 20.527
Coef of variation 0.010341
Kurtosis -1.2465
Mean 1985
MAD 18.448
Skewness -0.52031
Sum 2381985
Variance 421.37
Memory size 9.5 KiB
Value Count Frequency (%)  
1950 140 11.7%
 
2006 80 6.7%
 
2007 58 4.8%
 
2005 56 4.7%
 
2004 54 4.5%
 
2000 44 3.7%
 
2003 41 3.4%
 
2002 41 3.4%
 
2008 31 2.6%
 
1996 31 2.6%
 
Other values (51) 624 52.0%
 

Minimum 5 values

Value Count Frequency (%)  
1950 140 11.7%
 
1951 3 0.2%
 
1952 5 0.4%
 
1953 10 0.8%
 
1954 12 1.0%
 

Maximum 5 values

Value Count Frequency (%)  
2006 80 6.7%
 
2007 58 4.8%
 
2008 31 2.6%
 
2009 20 1.7%
 
2010 6 0.5%
 

YrSold
Numeric

Distinct count 5
Unique (%) 0.4%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 2007.8
Minimum 2006
Maximum 2010
Zeros (%) 0.0%

Quantile statistics

Minimum 2006
5-th percentile 2006
Q1 2007
Median 2008
Q3 2009
95-th percentile 2010
Maximum 2010
Range 4
Interquartile range 2

Descriptive statistics

Standard deviation 1.319
Coef of variation 0.00065695
Kurtosis -1.181
Mean 2007.8
MAD 1.142
Skewness 0.10228
Sum 2409373
Variance 1.7398
Memory size 9.5 KiB
Value Count Frequency (%)  
2009 281 23.4%
 
2007 280 23.3%
 
2006 253 21.1%
 
2008 247 20.6%
 
2010 139 11.6%
 

Minimum 5 values

Value Count Frequency (%)  
2006 253 21.1%
 
2007 280 23.3%
 
2008 247 20.6%
 
2009 281 23.4%
 
2010 139 11.6%
 

Maximum 5 values

Value Count Frequency (%)  
2006 253 21.1%
 
2007 280 23.3%
 
2008 247 20.6%
 
2009 281 23.4%
 
2010 139 11.6%
 

Correlations

Sample

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196.0 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NaN Attchd 2003.0 RFn 2 548 TA TA Y 0 61 0 0 0 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0.0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976.0 RFn 2 460 TA TA Y 298 0 0 0 0 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162.0 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001.0 RFn 2 608 TA TA Y 0 42 0 0 0 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0.0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998.0 Unf 3 642 TA TA Y 0 35 272 0 0 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350.0 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000.0 RFn 3 836 TA TA Y 192 84 0 0 0 0 NaN NaN NaN 0 12 2008 WD Normal 250000
We can see that we have some missing data,both for numerical and categorical data.
For categorical data,missing data refers to the fact that the house doesn't have the feature. For example, we have a lot of missing values in the feature PoolQC. This means that there are lots of houses that have a pool quality of "NaN",meaning they don't have a pool.
What we want to solve is the missing values for numerical features such as Lot Frontage.
But before we do so,let us first separate the features from the output and join both train and test dataframes so that we can preprocess their data together.

Data Cleaning and Feature Selection

In [6]:
#get the IDs
train_ID=train.Id
test_ID=test.Id

#get sales price
y=train.SalePrice
In [7]:
#drop the non-feature columns
train=train.drop(['Id','SalePrice'],axis=1)
test=test.drop('Id',axis=1)
We now append the test to the train set,so that we can preprocess their data together,but first we save a variable so that we can split the data then.
In [8]:
index=train.shape[0]
In [9]:
#combine both dataframes
combined=train.append(test)
Let us now deal with the missing values for numerical features.
In [10]:
#this is the number of elements for each feature
combined.describe().iloc[0,:]
Out[10]:
MSSubClass       1460.0
LotFrontage      1201.0
LotArea          1460.0
OverallQual      1460.0
OverallCond      1460.0
YearBuilt        1460.0
YearRemodAdd     1460.0
MasVnrArea       1452.0
BsmtFinSF1       1460.0
BsmtFinSF2       1460.0
BsmtUnfSF        1460.0
TotalBsmtSF      1460.0
1stFlrSF         1460.0
2ndFlrSF         1460.0
LowQualFinSF     1460.0
GrLivArea        1460.0
BsmtFullBath     1460.0
BsmtHalfBath     1460.0
FullBath         1460.0
HalfBath         1460.0
BedroomAbvGr     1460.0
KitchenAbvGr     1460.0
TotRmsAbvGrd     1460.0
Fireplaces       1460.0
GarageYrBlt      1379.0
GarageCars       1460.0
GarageArea       1460.0
WoodDeckSF       1460.0
OpenPorchSF      1460.0
EnclosedPorch    1460.0
3SsnPorch        1460.0
ScreenPorch      1460.0
PoolArea         1460.0
MiscVal          1460.0
MoSold           1460.0
YrSold           1460.0
Name: count, dtype: float64
We can see that we have missing values in the features: LotFrontage,MasVnrArea,and GarageYrBlt.
Let us check each feature separatly and figure out how to impute the missing values.
In [11]:
combined.LotFrontage.describe()
Out[11]:
count    1201.000000
mean       70.049958
std        24.284752
min        21.000000
25%        59.000000
50%        69.000000
75%        80.000000
max       313.000000
Name: LotFrontage, dtype: float64
We have a few undefined values in the columns. We are going to fill them by the mean of the column.
(filling by the mean is not the best approach,but since we are implementing a naive approach,we will fill by the mean as it is the one of the most common ways for imputing missing data)
In [12]:
combined.LotFrontage.fillna(combined.LotFrontage.mean(),inplace=True)
In [13]:
combined.LotFrontage.describe()
Out[13]:
count    1460.000000
mean       70.049958
std        22.024023
min        21.000000
25%        60.000000
50%        70.049958
75%        79.000000
max       313.000000
Name: LotFrontage, dtype: float64
Let's check for MasVnArea.
In [14]:
combined.MasVnrArea.describe()
Out[14]:
count    1452.000000
mean      103.685262
std       181.066207
min         0.000000
25%         0.000000
50%         0.000000
75%       166.000000
max      1600.000000
Name: MasVnrArea, dtype: float64
We also have a few missing values that we are going to fill with the mean of the column.
In [15]:
combined.MasVnrArea.fillna(combined.MasVnrArea.mean(),inplace=True)
Let's look at the last feature.
In [16]:
combined.GarageYrBlt.describe()
Out[16]:
count    1379.000000
mean     1978.506164
std        24.689725
min      1900.000000
25%      1961.000000
50%      1980.000000
75%      2002.000000
max      2010.000000
Name: GarageYrBlt, dtype: float64
We also have a few missing values that we are going to fill with the mean of the column.
In [17]:
combined.GarageYrBlt.fillna(combined.GarageYrBlt.mean(),inplace=True)
In [18]:
combined.describe().iloc[0,:]
Out[18]:
MSSubClass       1460.0
LotFrontage      1460.0
LotArea          1460.0
OverallQual      1460.0
OverallCond      1460.0
YearBuilt        1460.0
YearRemodAdd     1460.0
MasVnrArea       1460.0
BsmtFinSF1       1460.0
BsmtFinSF2       1460.0
BsmtUnfSF        1460.0
TotalBsmtSF      1460.0
1stFlrSF         1460.0
2ndFlrSF         1460.0
LowQualFinSF     1460.0
GrLivArea        1460.0
BsmtFullBath     1460.0
BsmtHalfBath     1460.0
FullBath         1460.0
HalfBath         1460.0
BedroomAbvGr     1460.0
KitchenAbvGr     1460.0
TotRmsAbvGrd     1460.0
Fireplaces       1460.0
GarageYrBlt      1460.0
GarageCars       1460.0
GarageArea       1460.0
WoodDeckSF       1460.0
OpenPorchSF      1460.0
EnclosedPorch    1460.0
3SsnPorch        1460.0
ScreenPorch      1460.0
PoolArea         1460.0
MiscVal          1460.0
MoSold           1460.0
YrSold           1460.0
Name: count, dtype: float64
We can see that we managed fix the issue of the missing values.
Let's now check the correlation between the features and the Sale Price feature.
In [19]:
combined['SalePrice']=y
In [20]:
combined.corr().SalePrice.sort_values(ascending=False)
Out[20]:
SalePrice        1.000000
OverallQual      0.654385
GrLivArea        0.590617
GarageCars       0.524004
GarageArea       0.513377
TotalBsmtSF      0.501647
1stFlrSF         0.490801
FullBath         0.461570
TotRmsAbvGrd     0.448564
YearBuilt        0.432041
YearRemodAdd     0.427852
MasVnrArea       0.403802
GarageYrBlt      0.386839
Fireplaces       0.372401
BsmtFinSF1       0.333616
LotFrontage      0.290158
2ndFlrSF         0.278702
WoodDeckSF       0.270388
OpenPorchSF      0.258385
HalfBath         0.234466
LotArea          0.233986
BsmtFullBath     0.198040
BsmtUnfSF        0.146941
BedroomAbvGr     0.141105
ScreenPorch      0.104810
PoolArea         0.046896
MoSold           0.041965
BsmtFinSF2       0.018506
3SsnPorch        0.012448
BsmtHalfBath    -0.014629
LowQualFinSF    -0.021420
YrSold          -0.036494
MiscVal         -0.050300
OverallCond     -0.056972
MSSubClass      -0.059659
EnclosedPorch   -0.116645
KitchenAbvGr    -0.129535
Name: SalePrice, dtype: float64
The features PoolArea, MoSold, BsmtFinSF2, 3SsnPorch, BsmtHalfBath, LowQualFinSF, YrSold,MiscVal, OverallCond, and MSSubClass have very low correlations,so we are going to drop them.
In [21]:
combined.drop(['PoolArea','MoSold','BsmtFinSF2','3SsnPorch','BsmtHalfBath','LowQualFinSF','YrSold','MiscVal','OverallCond','MSSubClass','SalePrice'],axis=1,inplace=True)
Now time for the categorical features. We are going to one-hot encode them,but before that we need to fill the missing values by "None".
In [22]:
combined.Alley.isna().sum()
Out[22]:
1369
In [23]:
combined.isna().sum().sort_values(ascending=False)
Out[23]:
PoolQC           1453
MiscFeature      1406
Alley            1369
Fence            1179
FireplaceQu       690
GarageFinish       81
GarageType         81
GarageCond         81
GarageQual         81
BsmtFinType2       38
BsmtExposure       38
BsmtQual           37
BsmtCond           37
BsmtFinType1       37
MasVnrType          8
Electrical          1
Exterior1st         0
Exterior2nd         0
RoofMatl            0
ExterCond           0
MasVnrArea          0
ExterQual           0
YearRemodAdd        0
Foundation          0
RoofStyle           0
SaleCondition       0
YearBuilt           0
OverallQual         0
HouseStyle          0
BldgType            0
                 ... 
LotFrontage         0
Condition1          0
TotalBsmtSF         0
BsmtFinSF1          0
KitchenQual         0
ScreenPorch         0
EnclosedPorch       0
OpenPorchSF         0
WoodDeckSF          0
PavedDrive          0
GarageArea          0
GarageCars          0
GarageYrBlt         0
Fireplaces          0
Functional          0
TotRmsAbvGrd        0
KitchenAbvGr        0
BsmtUnfSF           0
BedroomAbvGr        0
HalfBath            0
FullBath            0
BsmtFullBath        0
GrLivArea           0
2ndFlrSF            0
1stFlrSF            0
CentralAir          0
HeatingQC           0
Heating             0
SaleType            0
MSZoning            0
Length: 69, dtype: int64
In [24]:
combined.fillna("None",inplace=True)
In [25]:
#hot encode the data
combined=pd.get_dummies(combined)
Let's check if we have any missing values.
In [26]:
combined.isnull().sum().sort_values(ascending=False)
Out[26]:
SaleCondition_Partial    0
HouseStyle_SLvl          0
Condition2_PosN          0
Condition2_RRAe          0
Condition2_RRAn          0
Condition2_RRNn          0
BldgType_1Fam            0
BldgType_2fmCon          0
BldgType_Duplex          0
BldgType_Twnhs           0
BldgType_TwnhsE          0
HouseStyle_1.5Fin        0
HouseStyle_1.5Unf        0
HouseStyle_1Story        0
HouseStyle_2.5Fin        0
HouseStyle_2.5Unf        0
HouseStyle_2Story        0
Condition2_PosA          0
Condition2_Norm          0
Condition2_Feedr         0
Condition1_Feedr         0
Neighborhood_SawyerW     0
Neighborhood_Somerst     0
Neighborhood_StoneBr     0
Neighborhood_Timber      0
Neighborhood_Veenker     0
Condition1_Artery        0
Condition1_Norm          0
Condition2_Artery        0
Condition1_PosA          0
                        ..
Heating_OthW             0
Heating_Wall             0
HeatingQC_Ex             0
HeatingQC_Gd             0
BsmtFinType2_Unf         0
HeatingQC_Po             0
HeatingQC_TA             0
CentralAir_N             0
CentralAir_Y             0
Electrical_FuseA         0
Electrical_FuseF         0
Heating_Floor            0
BsmtFinType2_Rec         0
BsmtExposure_Gd          0
BsmtFinType1_LwQ         0
BsmtExposure_Mn          0
BsmtExposure_No          0
BsmtExposure_None        0
BsmtFinType1_ALQ         0
BsmtFinType1_BLQ         0
BsmtFinType1_GLQ         0
BsmtFinType1_None        0
BsmtFinType2_None        0
BsmtFinType1_Rec         0
BsmtFinType1_Unf         0
BsmtFinType2_ALQ         0
BsmtFinType2_BLQ         0
BsmtFinType2_GLQ         0
BsmtFinType2_LwQ         0
LotFrontage              0
Length: 293, dtype: int64
Let's now get the modified train and test sets.
In [27]:
mtrain=combined.iloc[:index,:]
In [28]:
mtest=combined.iloc[index:,:]

Testing Different Models and Comparing

We will train different models but we must first assess them.
We will apply cross fold validation in order to do so.
In [29]:
#this function takes an input as a model and prints the average rmse of k-fold cross validation
def model_score(model,mtrain,y,cv=50):
    #mse are by default negative in sklearn
    mse=-cross_val_score(model,mtrain,y,cv=50,scoring='neg_mean_squared_error')
    #take square root
    rmse=np.sqrt(mse)
    #average
    average_rmse=np.mean(rmse)
    print("The average RMSE is: %f." %rmse[0])
Now time to test some models.
We first initiate the model, then we fit the data to it, and finally we use model_score to assess the model.
In [30]:
#Linear Regressor
regressor=LinearRegression()
regressor.fit(mtrain,y)
model_score(regressor,mtrain,y)
The average RMSE is: 28940.975976.
In [31]:
#random forest
rf=RandomForestRegressor(n_estimators=100)
rf.fit(mtrain,y)
model_score(rf,mtrain,y)
The average RMSE is: 24086.086256.
In [32]:
#XGBoost
xgb=XGBRegressor()
xgb.fit(mtrain,y)
model_score(xgb,mtrain,y)
The average RMSE is: 22355.793436.
In [33]:
#SVR
svr=SVR(kernel='rbf')
svr.fit(mtrain,y)
model_score(svr,mtrain,y)
The average RMSE is: 72607.925589.
In [34]:
#we will the same models but we will apply PCA to the data
pca=PCA(n_components=100)
pca_mtrain=pca.fit_transform(mtrain)
In [35]:
regressor.fit(pca_mtrain,y)
model_score(regressor,pca_mtrain,y)
The average RMSE is: 28435.432337.
In [36]:
rf.fit(pca_mtrain,y)
model_score(rf,pca_mtrain,y)
The average RMSE is: 23866.225774.
In [37]:
xgb.fit(pca_mtrain,y)
model_score(xgb,pca_mtrain,y)
The average RMSE is: 25108.068674.
In [38]:
svr=SVR(kernel='rbf')
svr.fit(pca_mtrain,y)
model_score(svr,mtrain,y)
The average RMSE is: 72607.925589.

Remarks and Notes

  • Clearly this approach isn't the best as many things we did were not justified scientifically.
  • Even if you use the most sophisticated model, if you input garbage,then you will also output garbage.
  • A lot of work needs to be done on data cleaning and feature selection


All this will be tackled in the second approach.

Approach #2

The approach done before was a bit naive and lacks any statistical reasons behind it.
We fix this here in this approach and hope to get a lower rmse.
We first begin by reimporting the data again.
In [39]:
def import_data():
    train=pd.read_csv("Challenge Data/train.csv")
    test=pd.read_csv("Challenge Data/test.csv")
    return train,test


train,test=import_data()
Let's separate the IDs and Sale Price
In [40]:
#get IDs
train_ID=train.Id
test_ID=test.Id
In [41]:
#get SalePrices
y=train.SalePrice
In [42]:
train.shape
Out[42]:
(1200, 81)
In [43]:
#drop the unwanted columns
train.drop(['Id','SalePrice'],axis=1,inplace=True)
test.drop('Id',axis=1,inplace=True)
In [44]:
train.shape
Out[44]:
(1200, 79)
Let's see how much missing values we have.
In [45]:
#total number of rows of dataset
n_rows=train.shape[0]+test.shape[0]
In [46]:
#proportions of number of rows that have missing data to total number of rows 
test_miss=test.isna().sum()/n_rows
train_miss=train.isna().sum()/n_rows
In [51]:
plt.figure(figsize=(13,7))
plt.title("Percentage of missing values in training dataset")
train_miss[train_miss>0].sort_values(ascending=True).plot(kind="barh")
plt.xlabel("Percentage")
plt.show()
In [52]:
plt.figure(figsize=(13,7))
plt.title("Percentage of missing values in test dataset")
test_miss[test_miss>0].sort_values(ascending=True).plot(kind="barh")
plt.xlabel("Percentage")
plt.show()
In addition to the data visualisation done in Approach#1, we will explore the data more.
Let's start by plotting the Sale Price
In [55]:
plt.figure(figsize=(13,7))
sns.distplot(y)
plt.title("Distribution Plot of SalePrice")
Out[55]:
Text(0.5,1,'Distribution Plot of SalePrice')
We see here that the SalePrice is skewed and has a peak value. This mean more people tend to buy houses which are not expensive and a few people buy expensive houses.
Let's check its Kurtosis and Skewness.
To simply put it, a higher kurtosis means that the distribution has fat tails compared to a standard normal distribution.
A Positive skeweness means that the distribution is positively skewed,i.e., the tail is located to the right.
In [56]:
print("Kurtosis: %f" %y.kurt())
print("Skewness: %f" %y.skew())
Kurtosis: 7.033907
Skewness: 1.967215
Apparently, many people tend to take the log of right skewed depenedent variables and work with it and then use the exponentiation function to get the result.
We will do so and compare both ways to see which is better.
In [57]:
plt.figure(figsize=(13,7))
log_y=np.log(y)
sns.distplot(log_y)
plt.show()
print("Kurtosis: %f" %log_y.kurt())
print("Skewness: %f" %log_y.skew())
Kurtosis: 0.888850
Skewness: 0.132714
We can see that the data has become more uniform.
We will now study numerical and categorical features.
Let's write a function that extracts the numerical and categorical features.
In [58]:
def get_features(train):
    #select all numerical feature names
    numerical_features=train.select_dtypes(include=np.number).columns

    #select all categorical feature names
    categorical_features=train.select_dtypes(exclude=np.number).columns
    
    return numerical_features,categorical_features
In [59]:
numerical_features,categorical_features=get_features(train)
We begin with numerical features. We will study each one by plotting distribution of each and a scatter plot against the SalePrice.
Let's begin with the distribution plot.
In [60]:
train.hist(figsize=(25,20))
plt.show()
A few remarks:
  • We have some features such as Garage Cars that seem to have a certain dominant value over the other values. A suggestion here might be to replace this dominant values by a "1" and all others by a "0".
  • It might be a good idea to sum all features related to a house area into one feature that denotes the total house area.
  • The YrSold variable seems to have nearly the same number of houses sold in all years except for 2010. That was probably due to the financial crisis that hit the world. We might encode it as a cat egorical variable with a value of "1" for years where sold houses are in 2010 and 0 for years which houses were not sold in 2010.
  • The GarageYrBuilt and the YearBuilt features are somehow similar. Usually,one builds the house along with the garage. It seems reasonable here to drop one of these variables as they are redundant.
  • The MiscVal feature seems useless as most of the data is 0 and it would be a good idea to drop the feature.
  • A lot of houses don't seem to have pools
  • It might be a good idea to sum features corresponding to the number of rooms,number of kitchens,number of bedrooms,and number of baths into one feature

We will tackle these in the Feature Engineering and Selection part.

Now let's plot scatter plots between each numerical features and the SalePrice variable.
As garage cars increase,garage area increases.
In [61]:
#dictionary contraing features name as keys and correlation with sale price as value
corr_dict=train[numerical_features].corrwith(y).sort_values(ascending=False).to_dict()
In [62]:
#graphs are sorted by alphabetical order of the feature names
i=1
plt.figure(figsize=(55, 60))
for k in sorted(corr_dict):
    plt.subplot(6,6,i)
    plt.scatter(train[k],y)
    plt.title("Correlation: %f" %corr_dict[k],fontsize=30)
    plt.xlabel(k,fontsize=25)
    plt.ylabel("Sale Prices",fontsize=25)
    i+=1
A few remarks:
  • There are some outliers in features such as GrLivArea and TotalBsmtSF. We should consider removing them.
  • There are features such as BsmtFinSF2 that have near zero correlation. In the previous approach, we removed such low corrleations. In this approach, we will keep a few of them and try to generate some useful features that might help us.
To finish visualisation of numerical features,let us plot a heat map to see correlations between the features.
In [63]:
#tryong colors for heat map
sns.palplot(sns.color_palette("RdBu_r", 6))
In [79]:
plt.subplots(figsize=(15,10))
ax=sns.heatmap(train.corr(),cmap=sns.color_palette("RdBu_r", 5),linewidths=.5,vmin=-1,vmax=1)
cbar = ax.collections[0].colorbar
cbar.set_ticks(np.arange(-1,1.1,0.2))
plt.title("Heatmap of Numerical Features")
plt.show()
We see that we have a lot of correlations between -0.2 and 0.2.
There also exists some nearly perfect corrleations between some features.
We will now study categorical features and then try to find best features to match our data.
We now study categorical features.
In [80]:
i=1
plt.figure(figsize=(80, 350))
for cat in sorted(categorical_features):
    plt.subplot(22,2,i)
    
    #this is just for the xlabels to not overlap
    if(cat=="Neighborhood"):
        plt.xticks(rotation=30)
        
    
    #stripplot generalizes scatter to categorical variables
    sns.stripplot(train[cat],y)
    plt.xlabel(cat,fontsize=30)
    plt.ylabel("Sale Prices",fontsize=30)
    plt.xticks(fontsize=27)
    plt.yticks(fontsize=27)
    i+=1
  • Obviously,we have missing values by looking at the feature PoolQC. We will deal with those in the next section.
  • Most of the data in the feature Utilities is dominated by a single value and we could consider removing it. Usually,people buy a house equiped with all the utilities.(Electricity, Water, Gas, and Sewer)

We will now impute the missing data.

Data Cleaning

Let's see how much missing values we have for numerical features in the train and test dataset.
In [81]:
miss_train=train.isna().sum().sort_values(ascending=False)
miss_train[miss_train>0]
Out[81]:
PoolQC          1196
MiscFeature     1153
Alley           1125
Fence            973
FireplaceQu      564
LotFrontage      210
GarageType        67
GarageFinish      67
GarageQual        67
GarageCond        67
GarageYrBlt       67
BsmtExposure      33
BsmtFinType2      33
BsmtCond          32
BsmtFinType1      32
BsmtQual          32
MasVnrArea         6
MasVnrType         6
dtype: int64
In [82]:
miss_test=test.isna().sum().sort_values(ascending=False)
miss_test[miss_test>0]
Out[82]:
PoolQC          257
MiscFeature     253
Alley           244
Fence           206
FireplaceQu     126
LotFrontage      49
GarageCond       14
GarageType       14
GarageYrBlt      14
GarageFinish     14
GarageQual       14
BsmtFinType1      5
BsmtExposure      5
BsmtCond          5
BsmtQual          5
BsmtFinType2      5
MasVnrArea        2
MasVnrType        2
Electrical        1
dtype: int64
Let's combine the train and test set so that we can clean thier data together.
In [83]:
#used to slice test from train
index=train.shape[0]
In [84]:
index
Out[84]:
1200
In [85]:
full=pd.concat([train,test],ignore_index=True)
Let's start with numerical features.
In [86]:
miss_full=full[numerical_features].isna().sum().sort_values(ascending=False)
miss_full=miss_full[miss_full>0]
miss_full
Out[86]:
LotFrontage    259
GarageYrBlt     81
MasVnrArea       8
dtype: int64
In [87]:
miss_full_cols=miss_full.index
In [185]:
pandas_profiling.ProfileReport(full[miss_full_cols])
Out[185]:

Overview

Dataset info

Number of variables 3
Number of observations 1460
Total Missing (%) 7.9%
Total size in memory 34.3 KiB
Average record size in memory 24.1 B

Variables types

Numeric 3
Categorical 0
Boolean 0
Date 0
Text (Unique) 0
Rejected 0
Unsupported 0

Warnings

  • GarageYrBlt has 81 / 5.5% missing values Missing
  • LotFrontage has 259 / 17.7% missing values Missing
  • MasVnrArea has 861 / 59.0% zeros Zeros
  • Dataset has 281 duplicate rows Warning

Variables

GarageYrBlt
Numeric

Distinct count 98
Unique (%) 6.7%
Missing (%) 5.5%
Missing (n) 81
Infinite (%) 0.0%
Infinite (n) 0
Mean 1978.5
Minimum 1900
Maximum 2010
Zeros (%) 0.0%

Quantile statistics

Minimum 1900
5-th percentile 1930
Q1 1961
Median 1980
Q3 2002
95-th percentile 2007
Maximum 2010
Range 110
Interquartile range 41

Descriptive statistics

Standard deviation 24.69
Coef of variation 0.012479
Kurtosis -0.41834
Mean 1978.5
MAD 20.913
Skewness -0.64941
Sum 2728400
Variance 609.58
Memory size 11.5 KiB
Value Count Frequency (%)  
2005.0 65 4.5%
 
2006.0 59 4.0%
 
2004.0 53 3.6%
 
2003.0 50 3.4%
 
2007.0 49 3.4%
 
1977.0 35 2.4%
 
1998.0 31 2.1%
 
1999.0 30 2.1%
 
1976.0 29 2.0%
 
2008.0 29 2.0%
 
Other values (87) 949 65.0%
 
(Missing) 81 5.5%
 

Minimum 5 values

Value Count Frequency (%)  
1900.0 1 0.1%
 
1906.0 1 0.1%
 
1908.0 1 0.1%
 
1910.0 3 0.2%
 
1914.0 2 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
2006.0 59 4.0%
 
2007.0 49 3.4%
 
2008.0 29 2.0%
 
2009.0 21 1.4%
 
2010.0 3 0.2%
 

LotFrontage
Numeric

Distinct count 111
Unique (%) 7.6%
Missing (%) 17.7%
Missing (n) 259
Infinite (%) 0.0%
Infinite (n) 0
Mean 70.05
Minimum 21
Maximum 313
Zeros (%) 0.0%

Quantile statistics

Minimum 21
5-th percentile 34
Q1 59
Median 69
Q3 80
95-th percentile 107
Maximum 313
Range 292
Interquartile range 21

Descriptive statistics

Standard deviation 24.285
Coef of variation 0.34668
Kurtosis 17.453
Mean 70.05
MAD 16.762
Skewness 2.1636
Sum 84130
Variance 589.75
Memory size 11.5 KiB
Value Count Frequency (%)  
60.0 143 9.8%
 
70.0 70 4.8%
 
80.0 69 4.7%
 
50.0 57 3.9%
 
75.0 53 3.6%
 
65.0 44 3.0%
 
85.0 40 2.7%
 
78.0 25 1.7%
 
21.0 23 1.6%
 
90.0 23 1.6%
 
Other values (100) 654 44.8%
 
(Missing) 259 17.7%
 

Minimum 5 values

Value Count Frequency (%)  
21.0 23 1.6%
 
24.0 19 1.3%
 
30.0 6 0.4%
 
32.0 5 0.3%
 
33.0 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
160.0 1 0.1%
 
168.0 1 0.1%
 
174.0 2 0.1%
 
182.0 1 0.1%
 
313.0 2 0.1%
 

MasVnrArea
Numeric

Distinct count 328
Unique (%) 22.5%
Missing (%) 0.5%
Missing (n) 8
Infinite (%) 0.0%
Infinite (n) 0
Mean 103.69
Minimum 0
Maximum 1600
Zeros (%) 59.0%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 166
95-th percentile 456
Maximum 1600
Range 1600
Interquartile range 166

Descriptive statistics

Standard deviation 181.07
Coef of variation 1.7463
Kurtosis 10.082
Mean 103.69
MAD 129.78
Skewness 2.6691
Sum 150550
Variance 32785
Memory size 11.5 KiB
Value Count Frequency (%)  
0.0 861 59.0%
 
72.0 8 0.5%
 
180.0 8 0.5%
 
108.0 8 0.5%
 
120.0 7 0.5%
 
16.0 7 0.5%
 
106.0 6 0.4%
 
80.0 6 0.4%
 
340.0 6 0.4%
 
200.0 6 0.4%
 
Other values (317) 529 36.2%
 
(Missing) 8 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 861 59.0%
 
1.0 2 0.1%
 
11.0 1 0.1%
 
14.0 1 0.1%
 
16.0 7 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
1115.0 1 0.1%
 
1129.0 1 0.1%
 
1170.0 1 0.1%
 
1378.0 1 0.1%
 
1600.0 1 0.1%
 

Correlations

Sample

LotFrontage GarageYrBlt MasVnrArea
0 65.0 2003.0 196.0
1 80.0 1976.0 0.0
2 68.0 2001.0 162.0
3 60.0 1998.0 0.0
4 84.0 2000.0 350.0
For the feature GarageYrBlt, we have some missing values. What we will do now is fill the missing values by zero since this means that these houses have no garage. We will experiment later on with other approaches such as removing the rows or filling them with the mode.
But before that,let's look more at the GarageYrBlt feature. We have that the minimum value for the feature "GarageYrBuilt" is 1900 whereas the minimum value for "YearBuilt" is 1872. That seems weird. Investigating more in the feature we observe that:
  • We have some rows where the garage was built many years after the house was built. This seems fine as the house might have been passed along many generations and one of those generations decided to buy a car and thus build a garage.
  • We have some rows where the garage was built before the house was built. Although this might seem strange,but if we started building a house and a garage,the garage clearly will be finished earlier than the house, so the year where the garage was finished must be less than that when the house was finished.

In [88]:
full[full.YearBuilt==1880][['YearBuilt','GarageYrBlt']]
Out[88]:
YearBuilt GarageYrBlt
304 1880 2003.0
630 1880 1937.0
747 1880 1950.0
1132 1880 1930.0
In [89]:
full[full.GarageYrBlt<full.YearBuilt][['YearBuilt','GarageYrBlt']]
Out[89]:
YearBuilt GarageYrBlt
29 1927 1920.0
93 1910 1900.0
324 1967 1961.0
600 2005 2003.0
736 1950 1949.0
1103 1959 1954.0
1376 1930 1925.0
1414 1923 1922.0
1418 1963 1962.0
In [104]:
plt.figure(figsize=(13,9))
full.GarageYrBlt.hist()
plt.title("Distribution Plot of GarageYrBlt")
plt.show()
In [93]:
#fill missing values by 0
full.GarageYrBlt.fillna(0,inplace=True)
For the feature LotFrontage,we are going to fill the missing values with the mode.
In [94]:
full.LotFrontage.fillna(full.LotFrontage.mode()[0],inplace=True)
For the feature MasVnrArea, we will fill the missing values by zero as it is the most common element.
In [95]:
full.MasVnrArea.fillna(0,inplace=True)
Before tackling categorical features,let us check that we don't have any missing values.
In [96]:
full[numerical_features].isnull().sum().sort_values(ascending=False)>0
Out[96]:
YrSold           False
MoSold           False
GrLivArea        False
LowQualFinSF     False
2ndFlrSF         False
1stFlrSF         False
TotalBsmtSF      False
BsmtUnfSF        False
BsmtFinSF2       False
BsmtFinSF1       False
MasVnrArea       False
YearRemodAdd     False
YearBuilt        False
OverallCond      False
OverallQual      False
LotArea          False
LotFrontage      False
BsmtFullBath     False
BsmtHalfBath     False
FullBath         False
WoodDeckSF       False
MiscVal          False
PoolArea         False
ScreenPorch      False
3SsnPorch        False
EnclosedPorch    False
OpenPorchSF      False
GarageArea       False
HalfBath         False
GarageCars       False
GarageYrBlt      False
Fireplaces       False
TotRmsAbvGrd     False
KitchenAbvGr     False
BedroomAbvGr     False
MSSubClass       False
dtype: bool
Now it's time for the categorical variables. What we will do now is to replace every "nan" with "None" because when applying dummy enconding "None" is considered as a column while "nan" is not.
In [97]:
#number of missing values for categorical features
miss_full_cat=full[categorical_features].isna().sum().sort_values(ascending=False)
miss_full_cat=miss_full_cat[miss_full_cat>0]
miss_full_cat
Out[97]:
PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
FireplaceQu      690
GarageCond        81
GarageQual        81
GarageFinish      81
GarageType        81
BsmtFinType2      38
BsmtExposure      38
BsmtFinType1      37
BsmtQual          37
BsmtCond          37
MasVnrType         8
Electrical         1
dtype: int64
Consulting the "Data description" file given,most categorical features that contain missing values correspond to not having that feature. For example,missing values for the feature "PoolQC" means that the house doesn't contain a pool.
However,there is only one categorical feature that contains "nan" values and it shouldn't which is the "Electrical" feature
Let's deal with it.
In [98]:
full["Electrical"].describe()
Out[98]:
count      1459
unique        5
top       SBrkr
freq       1334
Name: Electrical, dtype: object
In [99]:
miss_full_cat['Electrical']
Out[99]:
1
In [100]:
print("Unique Values for the feature Electrical:", full["Electrical"].unique())
print("Number of missing values for the feature Electrical:", miss_full_cat["Electrical"])
print("\n")
Unique Values for the feature Electrical: ['SBrkr' 'FuseF' 'FuseA' 'FuseP' 'Mix' nan]
Number of missing values for the feature Electrical: 1


In [105]:
plt.figure(figsize=(13,9))
full.Electrical.hist()
plt.title("Distribution plot of Electrical")
plt.show()
In [106]:
full[full.Electrical.isnull()>0]
Out[106]:
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
1379 80 RL 73.0 9735 Pave NaN Reg Lvl AllPub Inside ... 0 0 NaN NaN NaN 0 5 2008 WD Normal

1 rows × 79 columns

Since the number of missing data is small,it doesn't hurt fill it with the most common element "SBrkr".
In [107]:
full.Electrical.fillna('SBrkr',inplace=True)
Now having finished all this we can proceed for feature engineering.

Feature Engineering and Selection

Let's start with the numerical features.
Let's replace the features that have to do with house area with one feature corresponding to the totalSF.
Let's have a look at the the features related to basement area.
We have the features BsmtFinSF1,BsmtFinSF2, BsmtUnfSF, and TotalBsmtSF.
It wouldn't be a surprise if the sum of the first three features is equal to data in TotalBsmtSF.
Let's see!
In [108]:
#sum all basement area
bsmtsum=full.BsmtFinSF1+full.BsmtFinSF2+full.BsmtUnfSF
In [109]:
#compare to total sum
bsmtsum.equals(full.TotalBsmtSF)
Out[109]:
True
This means that we can drop the 3 features and keep only the TotalBsmtSF feature!
In [110]:
full.shape
Out[110]:
(1460, 79)
In [111]:
full.drop('BsmtFinSF1',axis=1,inplace=True)
full.drop('BsmtFinSF2',axis=1,inplace=True)
full.drop('BsmtUnfSF',axis=1,inplace=True)
Now let's look at the area of the house. The feature GrLivArea is the area above ground.
Let's try to sum some features and see.
In [112]:
(full['1stFlrSF']+full['2ndFlrSF']+full.LowQualFinSF).equals(full.GrLivArea)
Out[112]:
True
It seems that the sum of 1stFlrSF, 2ndFlrSF, and LowQualFinSF equals the GrLivArea feature.
We can keep only the GrLivArea and drop the others.
In [113]:
full.drop('LowQualFinSF',axis=1,inplace=True)
full.drop('1stFlrSF',axis=1,inplace=True)
full.drop('2ndFlrSF',axis=1,inplace=True)
Let's look at the porches.
In [114]:
(full['3SsnPorch']+full.EnclosedPorch+full.OpenPorchSF+full.ScreenPorch).iloc[:index].corr(y)
Out[114]:
0.19782953308159243
We still have these raw features left: 3SsnPorch,EnclosedPorch,GarageArea,LotArea,MasVnrArea,OpenPorchSF,PoolArea,ScreenPorch,WoodDeckSF. We will leave them for now.
Let's drop useless features such MiscVal.
In [115]:
full.drop('MiscVal',axis=1,inplace=True)
Now off to categorical features.
Let's plot histograms of each feature.
In [119]:
i=1
plt.figure(figsize=(50, 250))
for cat in sorted(categorical_features):
    plt.subplot(22,2,i)

    
    #this is just for the xlabels to not overlap
    if(cat in ["Neighborhood","Exterior1st","Exterior2nd"]):
        plt.xticks(rotation=30)    
    

    full[cat].hist()
    plt.xticks(fontsize=20)
    plt.yticks(fontsize=20)
    plt.title("{}".format(cat),size=20)

    #plt.show()
    i+=1
Exterior1st looks much like Exterior2nd so it is safe to remove one of them.
In [120]:
full.drop('Exterior1st',axis=1,inplace=True)
We can remove Utilities because nearly all values are the same and that doens't help us much.
In [121]:
full.drop('Utilities',axis=1,inplace=True)
In [122]:
full.isnull().sum().sort_values(ascending=False)
Out[122]:
PoolQC           1453
MiscFeature      1406
Alley            1369
Fence            1179
FireplaceQu       690
GarageType         81
GarageCond         81
GarageFinish       81
GarageQual         81
BsmtFinType2       38
BsmtExposure       38
BsmtCond           37
BsmtQual           37
BsmtFinType1       37
MasVnrType          8
LandSlope           0
LotConfig           0
MSZoning            0
LotFrontage         0
LotArea             0
Foundation          0
ExterCond           0
ExterQual           0
MasVnrArea          0
Exterior2nd         0
RoofMatl            0
RoofStyle           0
YearRemodAdd        0
YearBuilt           0
OverallCond         0
                 ... 
HeatingQC           0
TotalBsmtSF         0
Fireplaces          0
YrSold              0
MoSold              0
PoolArea            0
ScreenPorch         0
3SsnPorch           0
EnclosedPorch       0
OpenPorchSF         0
WoodDeckSF          0
PavedDrive          0
GarageArea          0
GarageCars          0
GarageYrBlt         0
Functional          0
Heating             0
TotRmsAbvGrd        0
KitchenQual         0
KitchenAbvGr        0
BedroomAbvGr        0
HalfBath            0
FullBath            0
BsmtHalfBath        0
BsmtFullBath        0
GrLivArea           0
Electrical          0
CentralAir          0
SaleType            0
MSSubClass          0
Length: 70, dtype: int64
Now let's fill the missing values by "None".
In [123]:
full.fillna("None",inplace=True)
What we will do now is hot encode the data. We will try with removing one columns of each dummy variable to avoid the dummy trap just for experimenting
In [124]:
full_wd=full.copy() #wd=with droping first dummy variable column
full_wod=full.copy()#wod=without dropping first dummy variable column
In [125]:
#dummy encode
full_wd=pd.get_dummies(full_wd)
full_wod=pd.get_dummies(full_wod,drop_first=True)
In [126]:
#split to train and test
train_wd=full_wd.iloc[:index,:]
test_wd=full_wd.iloc[index:,:]
In [127]:
#split to train and test
train_wod=full_wod.iloc[:index,:]
test_wod=full_wod.iloc[index:,:]

Training Time

In [128]:
from sklearn.cross_validation import cross_val_score
def model_score(model,mtrain,y,cv=50):
    #mse are by default negative in sklearn
    mse=-cross_val_score(model,mtrain,y,cv=cv,scoring='neg_mean_squared_error')
    rmse=np.sqrt(mse)
    average_rmse=np.mean(rmse)
    std=np.std(rmse)
    #print(rmse)
    print("The average RMSE is: %f with a standard deviation of %f." %(average_rmse,std))
We will try several models.
In [129]:
#linear regressor: train_wd+log_y
lm1=LinearRegression()
lm1.fit(train_wd,log_y)
model_score(lm1,train_wd,log_y)
The average RMSE is: 0.137537 with a standard deviation of 0.057180.
In [130]:
#linear regressor: train_wod+log_y
lm2=LinearRegression()
lm2.fit(train_wod,log_y)
model_score(lm2,train_wod,log_y)
The average RMSE is: 0.133158 with a standard deviation of 0.051099.
Logging output variable and removing dummy variable seems to work best.
In [131]:
#random forest: train_wod+log_y
rlf=RandomForestRegressor()
rlf.fit(train_wod,log_y)
model_score(rlf,train_wod,log_y)
The average RMSE is: 0.149585 with a standard deviation of 0.036118.
In [132]:
#random forest: train_wd+log_y
rlf2=RandomForestRegressor()
rlf2.fit(train_wd,log_y)
model_score(rlf2,train_wd,log_y)
The average RMSE is: 0.149487 with a standard deviation of 0.038988.
We will now use XGBoost which is a kind of boosted trees.
In [133]:
#XGBoost: train_wod+log_y
xgb=XGBRegressor()
xgb.fit(train_wod,log_y)
model_score(xgb,train_wod,log_y)
The average RMSE is: 0.129749 with a standard deviation of 0.038742.
In [134]:
#XGBoost: train_wdd+log_y
xgb=XGBRegressor()
xgb.fit(train_wd,log_y)
model_score(xgb,train_wd,log_y)
The average RMSE is: 0.129172 with a standard deviation of 0.038268.
Our approach this time produced better results.
However we seek an even less RMSE.
Aside from the models,the decisive factor for a lower RMSE is choosing the right features and we didn't do a logical plan to choose them in this approach.We fix this in the following approach.

Approach #3

Let us first execute the essential starting code.
In [135]:
train,test=import_data()

#get IDs
train_ID=train.Id
test_ID=test.Id

#get SalePrices
y=train.SalePrice

#drop the unwanted columns
train.drop(['Id','SalePrice'],axis=1,inplace=True)
test.drop('Id',axis=1,inplace=True)

#log the target
log_y=np.log(y)

numerical_features,categorical_features=get_features(train)

#combine train and test data
full=pd.concat([train,test],ignore_index=True)
We did some data visualisation in previous approaches and thus won't do it here.
We will directly work with data cleaning and feature selection. Let's look again at the missing values for numerical and categorcal features.
Let's start with numerical features.
In [136]:
num_miss=full[numerical_features].isna().sum().sort_values(ascending=False)
num_miss[num_miss>0]
Out[136]:
LotFrontage    259
GarageYrBlt     81
MasVnrArea       8
dtype: int64
We have only 3 numerical features that have missing values.
Let's look at each one separately.
LotFrontage is the linear feet of street connected to property.
Usually, houses in the same neighborhood have their Lot Frontage values very close as seen below. So we will compute the average LotFrontage for each neighborhood and fill the missing values by the average.
For example,the average Lot Frontage for the Neighborhood Blmngtn is 47.142857,so we will fill each row that has Blmngtn as a Neighborhood and has a missing value for Lot Frontage by this average. Ofcourse,we are going to round this value as all values in LotFrontage are integers.
In [137]:
df=full[['LotFrontage','Neighborhood']]
In [138]:
neighb=df.Neighborhood.unique()
In [139]:
neighb
Out[139]:
array(['CollgCr', 'Veenker', 'Crawfor', 'NoRidge', 'Mitchel', 'Somerst',
       'NWAmes', 'OldTown', 'BrkSide', 'Sawyer', 'NridgHt', 'NAmes',
       'SawyerW', 'IDOTRR', 'MeadowV', 'Edwards', 'Timber', 'Gilbert',
       'StoneBr', 'ClearCr', 'NPkVill', 'Blmngtn', 'BrDale', 'SWISU',
       'Blueste'], dtype=object)
In [140]:
for n in neighb:
    df[df.Neighborhood==n].plot.bar(figsize=(10,5),xticks=None,color = "skyblue")
    plt.tick_params(axis='x',which='both',bottom='off',top='off',labelbottom='off') 
    plt.title("Lot Frontages for %s" %n)
    plt.axhline(y=df[df.Neighborhood==n].mean()[0], color='r', linestyle='--',label='Mean')
    plt.ylabel("Lot Frontage")
    plt.legend()
As we can see, usually most of the houses have their Lot Frontage close to the mean,so it seems a good value to fill the missing value with.
Let's now fill the missing values with the average of each neighborhood.
In [141]:
full.LotFrontage=full.groupby("Neighborhood")["LotFrontage"].transform(lambda x: x.fillna(round(x.mean())))
In [142]:
#check for missing values
full.LotFrontage.isna().sum()
Out[142]:
0
We now have to deal with the other 2 numerical variables.
Let's start with GarageYrBlt. These houses don't have garages so it seems logical that for these houses all features that have to do with garages are null.
Let's check.
In [143]:
garage_features=[n  for n in full.columns if 'Garage' in n]
garage_features
Out[143]:
['GarageType',
 'GarageYrBlt',
 'GarageFinish',
 'GarageCars',
 'GarageArea',
 'GarageQual',
 'GarageCond']
In [144]:
full[garage_features].isna().sum()
Out[144]:
GarageType      81
GarageYrBlt     81
GarageFinish    81
GarageCars       0
GarageArea       0
GarageQual      81
GarageCond      81
dtype: int64
All garage features have the same number of missing values except for Garage Cars and Garage Area.
Their values should sum to zero since if they don't have any garage than their garage cars and area must be zero. Also,their count should be 81.
Let us check.
In [145]:
full[full.GarageType.isna()].GarageCars.sum()
Out[145]:
0
In [146]:
full[full.GarageCars==0].shape[0]
Out[146]:
81
In [147]:
full[full.GarageType.isna()].GarageArea.sum()
Out[147]:
0
In [148]:
full[full.GarageArea==0].shape[0]
Out[148]:
81
Also, we need to check that the 81 missing values of the garage features correspond to the same entries.
In [149]:
full.GarageType.isna().equals(full.GarageYrBlt.isna())
Out[149]:
True
In [150]:
full.GarageType.isna().equals(full.GarageCond.isna())
Out[150]:
True
In [151]:
full.GarageType.isna().equals(full.GarageQual.isna())
Out[151]:
True
Now let us discuss a procedure for filling the GarageYrBlt feature.
These houses don't have a garage so it would be useless to specify a year where the garage would be built.
If we fill these missing values with zero, we observe a big decrease in the correlation with the target feature.
In [152]:
full.GarageYrBlt.corr(y)
Out[152]:
0.48979373237342116
In [153]:
full.GarageYrBlt.fillna(0).corr(y)
Out[153]:
0.2551230905554551
However,looking at the graphs in a previous approach and consulting the correlation matrix, we find a large correlation with the feature YearBuilt.
In [154]:
full.GarageYrBlt.corr(full.YearBuilt)
Out[154]:
0.825667484174342
Thus we are going to drop this feature.
In [155]:
full.drop('GarageYrBlt',axis=1,inplace=True)
We still have one numerical feature, MasVnrArea which is the masonry veneer area.
A missing value in this feature should mean that houses with these missing values don't have masonry veneers. Thus all features related to masonry veneer should be missing or have a value of zero.
Let's check.
In [156]:
mason_feat=[ind for ind in full.columns if 'Mas' in ind]
In [157]:
full[mason_feat].isna().sum()
Out[157]:
MasVnrType    8
MasVnrArea    8
dtype: int64
It would be logical to fill houses with no masonry veneer area with 0.
In [158]:
full.MasVnrArea.fillna(0,inplace=True)
Let's check if we still have any missing values for numerical features aside from the GarageYrBlt feature.
In [159]:
numerical_features,categorical_features=get_features(full)
In [160]:
full[numerical_features].isna().sum().sort_values(ascending=False)
Out[160]:
YrSold           0
BsmtFinSF1       0
LowQualFinSF     0
2ndFlrSF         0
1stFlrSF         0
TotalBsmtSF      0
BsmtUnfSF        0
BsmtFinSF2       0
MasVnrArea       0
BsmtFullBath     0
YearRemodAdd     0
YearBuilt        0
OverallCond      0
OverallQual      0
LotArea          0
LotFrontage      0
GrLivArea        0
BsmtHalfBath     0
MoSold           0
WoodDeckSF       0
MiscVal          0
PoolArea         0
ScreenPorch      0
3SsnPorch        0
EnclosedPorch    0
OpenPorchSF      0
GarageArea       0
FullBath         0
GarageCars       0
Fireplaces       0
TotRmsAbvGrd     0
KitchenAbvGr     0
BedroomAbvGr     0
HalfBath         0
MSSubClass       0
dtype: int64
Now it's time for categorical features.
In [161]:
full[categorical_features].isna().sum().sort_values(ascending=False)
Out[161]:
PoolQC           1453
MiscFeature      1406
Alley            1369
Fence            1179
FireplaceQu       690
GarageCond         81
GarageQual         81
GarageFinish       81
GarageType         81
BsmtFinType2       38
BsmtExposure       38
BsmtFinType1       37
BsmtQual           37
BsmtCond           37
MasVnrType          8
Electrical          1
Condition2          0
Condition1          0
Neighborhood        0
LandSlope           0
BldgType            0
LandContour         0
LotConfig           0
Utilities           0
RoofStyle           0
LotShape            0
Street              0
HouseStyle          0
SaleCondition       0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
ExterQual           0
ExterCond           0
Foundation          0
SaleType            0
Heating             0
HeatingQC           0
CentralAir          0
KitchenQual         0
Functional          0
PavedDrive          0
MSZoning            0
dtype: int64
Let's go through each feature beginning with PoolQC
Missing values in this feature means that the house has no pool,thus the pool area should be zero here.
Let's check.
In [162]:
full.PoolArea[full.PoolQC.isna()].sum()
Out[162]:
0
Also we shouldn't have a Pool Area of zero when the PoolQC is not null.
In [163]:
full[full.PoolArea==0].equals(full[full.PoolQC.isna()])
Out[163]:
True
We will now fill the missing values here with "No Pool".
In [164]:
full.PoolQC.fillna('No Pool',inplace=True)
Now let's see the feature MiscFeature.
In [165]:
plt.figure(figsize=(10,7))
full.MiscFeature.hist()
plt.xlabel("MiscFeature")
plt.ylabel("Count")
plt.show()
I will replace the missing values with "None".
In [166]:
full.MiscFeature.fillna("None",inplace=True)
In [167]:
plt.figure(figsize=(10,7))
full.MiscFeature.hist()
plt.xlabel("MiscFeature")
plt.ylabel("Count")
plt.show()
Now let's look at Alley.
In [169]:
plt.figure(figsize=(10,7))
full.Alley.hist()
plt.xlabel("Alley")
plt.ylabel("Count")
plt.show()
I will replace the missing value with "No Alley".
In [170]:
full.Alley.fillna("No Alley",inplace=True)
In [171]:
plt.figure(figsize=(10,7))
full.Alley.hist()
plt.xlabel("Alley")
plt.ylabel("Count")
plt.show()
The features Alley and MiscFeatures seem to have a lot of their data dominated by one value.
We might consider removing them later.
Now time for Fence.
We will fill the missing values by "No Fence".
In [172]:
plt.figure(figsize=(10,7))
full.Fence.fillna("No Fence",inplace=True)
full.Fence.hist()
plt.xlabel("Fence")
plt.ylabel("Count")
plt.show()
Now time for the feature FireplaceQu.
In [173]:
plt.figure(figsize=(10,7))
full.FireplaceQu.hist()
plt.xlabel("Fireplace Quality")
plt.ylabel("Count")
plt.show()
The houses with missing values for FireplaceQu should be identical to the rows of houses that have a zero in the feature Fireplaces.
In [174]:
full[full.FireplaceQu.isna()].equals(full[full.Fireplaces==0])
Out[174]:
True
We will fill the missing values with "None".
In [175]:
full.FireplaceQu.fillna("NA",inplace=True)
plt.figure(figsize=(10,7))
full.FireplaceQu.hist()
plt.xlabel("Fireplace Quality")
plt.ylabel("Count")
plt.show()
We can replace the values here by numbers as "Ex" is greater than "Po" in terms of fire place quality.
We will do that in the feature engineering part.
Now it's time for the garage variables.
We will fill the missing values with "No Garage".
In [176]:
full.GarageCond.fillna('No Garage',inplace=True)
full.GarageQual.fillna('No Garage',inplace=True)
full.GarageFinish.fillna('No Garage',inplace=True)
full.GarageType.fillna('No Garage',inplace=True)
In [177]:
#just to check
full[full.GarageArea==0].equals(full[full.GarageCond=='No Garage'])
Out[177]:
True
We can replace the values for the features GarageCond,GarageQual, and GarageFinish with numbers.
We will do so in the feature engineering part.
Now it's time for the basement variables.
We seem to have different numbers of missing values.
Let's check.
In [178]:
full.TotalBsmtSF.isna().sum()
Out[178]:
0
In [179]:
bsmt_features=[ind for ind in full.columns if 'Bsmt' in ind]
In [180]:
full[bsmt_features].isna().sum().sort_values(ascending=False)
Out[180]:
BsmtFinType2    38
BsmtExposure    38
BsmtFinType1    37
BsmtCond        37
BsmtQual        37
BsmtHalfBath     0
BsmtFullBath     0
TotalBsmtSF      0
BsmtUnfSF        0
BsmtFinSF2       0
BsmtFinSF1       0
dtype: int64
In [181]:
full[bsmt_features][full.BsmtFinType1.isna()].shape
Out[181]:
(37, 11)
In [182]:
full[bsmt_features][full.BsmtFinType2.isna()].shape
Out[182]:
(38, 11)
There is only one additional row.
Let's check it.
In [183]:
i=full[bsmt_features][full.BsmtFinType2.isna()].index.difference(full[bsmt_features][full.BsmtFinType1.isna()].index)[0]
i
Out[183]:
332
In [184]:
pd.DataFrame(full[bsmt_features].iloc[i,:])
Out[184]:
332
BsmtQual Gd
BsmtCond TA
BsmtExposure No
BsmtFinType1 GLQ
BsmtFinSF1 1124
BsmtFinType2 NaN
BsmtFinSF2 479
BsmtUnfSF 1603
TotalBsmtSF 3206
BsmtFullBath 1
BsmtHalfBath 0
Here,we have a missing value for BsmtFinType2. Since,this house has a value in the feature BsmtFinSF2,it should have a value in BsmtFinType2.
Let's look at the feature BsmtFinType2.
In [394]:
full.BsmtFinType2.hist()
plt.show()
Since most of the values of this feature are "Unf",we think hat it might be a good value.
However,we can't do this here as we only use "Unf" when the value of BsmtFinSF2 is zero and this is not the case here as we have finished area.
In [395]:
full[['BsmtFinSF2','BsmtFinType2']][full.BsmtFinType2=='Unf']
Out[395]:
BsmtFinSF2 BsmtFinType2
0 0 Unf
1 0 Unf
2 0 Unf
3 0 Unf
4 0 Unf
5 0 Unf
6 0 Unf
8 0 Unf
9 0 Unf
10 0 Unf
11 0 Unf
12 0 Unf
13 0 Unf
14 0 Unf
15 0 Unf
16 0 Unf
18 0 Unf
19 0 Unf
20 0 Unf
21 0 Unf
22 0 Unf
23 0 Unf
25 0 Unf
27 0 Unf
28 0 Unf
29 0 Unf
30 0 Unf
31 0 Unf
32 0 Unf
33 0 Unf
... ... ...
1425 0 Unf
1426 0 Unf
1427 0 Unf
1428 0 Unf
1429 0 Unf
1430 0 Unf
1431 0 Unf
1432 0 Unf
1433 0 Unf
1434 0 Unf
1435 0 Unf
1436 0 Unf
1437 0 Unf
1438 0 Unf
1440 0 Unf
1441 0 Unf
1442 0 Unf
1443 0 Unf
1444 0 Unf
1446 0 Unf
1447 0 Unf
1448 0 Unf
1449 0 Unf
1450 0 Unf
1451 0 Unf
1452 0 Unf
1453 0 Unf
1454 0 Unf
1455 0 Unf
1457 0 Unf

1256 rows × 2 columns

In [396]:
[(ind,full.BsmtFinType2[full.BsmtFinType2==ind].count()) for ind in full.BsmtFinType2.unique()]
Out[396]:
[('Unf', 1256),
 ('BLQ', 33),
 (nan, 0),
 ('ALQ', 19),
 ('Rec', 54),
 ('LwQ', 46),
 ('GLQ', 14)]
We see that the second most common value is Rec and so we will fill the missing value with this.
In [185]:
full.BsmtFinType2.iloc[i]='Rec'
Now we will see the additional missing value in BsmtExposure.
In [186]:
i=full[bsmt_features][full.BsmtExposure.isna()].index.difference(full[bsmt_features][full.BsmtFinType1.isna()].index)[0]
i
Out[186]:
948
In [187]:
pd.DataFrame(full[bsmt_features].iloc[i,:])
Out[187]:
948
BsmtQual Gd
BsmtCond TA
BsmtExposure NaN
BsmtFinType1 Unf
BsmtFinSF1 0
BsmtFinType2 Unf
BsmtFinSF2 0
BsmtUnfSF 936
TotalBsmtSF 936
BsmtFullBath 0
BsmtHalfBath 0
Since this house has a basement,we can't fill the missing value with "No Basement".
Let's try to figure a way.
In [400]:
full.BsmtExposure.hist()
plt.show()
Since the most common value is "No" and there is no direct relation BsmtExposure and other basement features, we will fill the missing value by "No".
In [188]:
full.BsmtExposure.iloc[i]='No'
Now we should have the same missing values for the basement features.
In [189]:
full[bsmt_features].isna().sum()
Out[189]:
BsmtQual        37
BsmtCond        37
BsmtExposure    37
BsmtFinType1    37
BsmtFinSF1       0
BsmtFinType2    37
BsmtFinSF2       0
BsmtUnfSF        0
TotalBsmtSF      0
BsmtFullBath     0
BsmtHalfBath     0
dtype: int64
In [190]:
#check if missing rows are the same for categorical features
full[full.BsmtQual.isna()].equals(full[full.BsmtFinType1.isna()])\
and full[full.BsmtQual.isna()].equals(full[full.BsmtFinType2.isna()])\
and full[full.BsmtQual.isna()].equals(full[full.BsmtCond.isna()]) \
and full[full.BsmtQual.isna()].equals(full[full.BsmtExposure.isna()])
Out[190]:
True
In [191]:
#check for numerical variable
full[full.BsmtQual.isna()].equals(full[full.TotalBsmtSF==0])
Out[191]:
True
Now we can fill the remaining values with "No Basement".
In [192]:
full.BsmtQual.fillna('No Basement',inplace=True)
full.BsmtCond.fillna('No Basement',inplace=True)
full.BsmtFinType1.fillna('No Basement',inplace=True)
full.BsmtExposure.fillna('No Basement',inplace=True)
full.BsmtFinType2.fillna('No Basement',inplace=True)
In [193]:
#check if we have any missing values for basement features
full[bsmt_features].isna().sum()
Out[193]:
BsmtQual        0
BsmtCond        0
BsmtExposure    0
BsmtFinType1    0
BsmtFinSF1      0
BsmtFinType2    0
BsmtFinSF2      0
BsmtUnfSF       0
TotalBsmtSF     0
BsmtFullBath    0
BsmtHalfBath    0
dtype: int64
Let's now look at the last two features.
For the feature MasVnrType,we can directly fill the missing value with "No Masonry Veneer".
In [194]:
full.MasVnrType.fillna("No Masonry Veneer",inplace=True)
For the Electrical feature, we will fill the missing value with the most common value.
In [408]:
full.Electrical.hist()
plt.xlabel("Electrical")
plt.ylabel("Count")
plt.show()
In [195]:
full.Electrical.fillna('SBrkr',inplace=True)
In [196]:
#check to see if we have any missing values
full.isna().sum().sort_values(ascending=False)
Out[196]:
SaleCondition    0
Foundation       0
RoofMatl         0
Exterior1st      0
Exterior2nd      0
MasVnrType       0
MasVnrArea       0
ExterQual        0
ExterCond        0
BsmtQual         0
YearRemodAdd     0
BsmtCond         0
BsmtExposure     0
BsmtFinType1     0
BsmtFinSF1       0
BsmtFinType2     0
BsmtFinSF2       0
BsmtUnfSF        0
RoofStyle        0
YearBuilt        0
SaleType         0
Utilities        0
MSZoning         0
LotFrontage      0
LotArea          0
Street           0
Alley            0
LotShape         0
LandContour      0
LotConfig        0
                ..
EnclosedPorch    0
ScreenPorch      0
CentralAir       0
PoolArea         0
PoolQC           0
Fence            0
MiscFeature      0
MiscVal          0
MoSold           0
YrSold           0
GarageCars       0
GarageFinish     0
GarageType       0
FireplaceQu      0
Electrical       0
1stFlrSF         0
2ndFlrSF         0
LowQualFinSF     0
GrLivArea        0
BsmtFullBath     0
BsmtHalfBath     0
FullBath         0
HalfBath         0
BedroomAbvGr     0
KitchenAbvGr     0
KitchenQual      0
TotRmsAbvGrd     0
Functional       0
Fireplaces       0
MSSubClass       0
Length: 78, dtype: int64
In [197]:
#we will keep a copy of the cleaned data for future use
full_cleaned=full.copy()
In [600]:
#full=full_cleaned.copy()

Feature Engineering

Let's begin with categorical features.
We will replace some categorcial features with numbers.
For example, let's look at the feature PoolQC.
In [198]:
full.PoolQC.unique()
Out[198]:
array(['No Pool', 'Ex', 'Fa', 'Gd'], dtype=object)
We can see that 'Ex' is greater than all other values in terms of quality.
So we can replace the values by numbers with higher numbers meaning higher quality.
In [199]:
dict1={'No Pool':0,'Fa':1,'Gd':2,'Ex':3}
In [200]:
full.PoolQC=full.PoolQC.apply(lambda x: dict1[x])
Let's look at MiscFeatures.
In [415]:
full.MiscFeature.hist()
plt.xlabel("MiscFeatures")
plt.ylabel("Count")
plt.show()
There seems to be a dominant value here.
Let's try replacing the dominant value "None" by a zero and the others by a "1" and we will see its correlation with the target variable.
If it is too low,we will delete it.
In [201]:
a=full.MiscFeature.apply(lambda x: 0 if x=='None'  else 1)
In [202]:
a.corr(y)
Out[202]:
-0.09204248012681689
In [203]:
delete=[] #list to store features to be deleted
delete.append("MiscFeature")
Let's look at Alley.
In [419]:
full.Alley.hist()
plt.xlabel("Alley")
plt.ylabel("Count")
plt.show()
There seems to be a dominant value here.
Let's try replacing the dominant value "None" by a zero and the others by a "1" and we will see its correlation with the target variable.
If it is too low,we will delete it.
In [204]:
a=full.Alley.apply(lambda x: 0 if x=='No Alley'  else 1)
In [205]:
a.corr(y)
Out[205]:
-0.12220790878535244
In [206]:
delete.append('Alley')
Let's look at Fence.
In [423]:
full.Fence.hist()
plt.xlabel("Fence")
plt.ylabel("Count")
plt.show()
We will dummy encode this feature later.
Let's look at the feature FireplaceQu.
In [424]:
full.FireplaceQu.hist()
plt.xlabel("Fireplace Quality")
plt.ylabel("Count")
plt.show()
We will replace these categories by numbers as we did before.
In [207]:
dict1={'NA':0,'Po':1,'TA':2,'Fa':3,'Gd':4,'Ex':5}
In [208]:
full.FireplaceQu=full.FireplaceQu.apply(lambda x: dict1[x])
Now let's look at the garage features and replace the necessary categories by numbers.
In [428]:
full.GarageCond.hist()
plt.xlabel("GarageCond")
plt.ylabel("Count")
plt.show()
In [209]:
#replace by numbers
dict1={'No Garage':0,'Po':1,'TA':2,'Fa':3,'Gd':4,'Ex':5}
In [210]:
full.GarageCond=full.GarageCond.apply(lambda x: dict1[x])
In [431]:
full.GarageQual.hist()
plt.xlabel("GarageQual")
plt.ylabel("Count")
plt.show()
In [211]:
full.GarageQual=full.GarageQual.apply(lambda x: dict1[x])
In [433]:
full.GarageFinish.hist()
plt.xlabel("GarageFinish")
plt.ylabel("Count")
plt.show()
In [212]:
#a finished garage cost more
dict1={'No Garage':0,'Unf':1,'RFn':2,'Fin':3}
In [213]:
full.GarageFinish=full.GarageFinish.apply(lambda x: dict1[x])
Now let's work with the basement features.
In [214]:
full.BsmtFinType2.unique()
Out[214]:
array(['Unf', 'BLQ', 'No Basement', 'ALQ', 'Rec', 'LwQ', 'GLQ'],
      dtype=object)
In [215]:
#higher basement rating means higher price
dict1={'No Basement':0, 'Unf':1, 'LwQ':2, 'Rec':3, 'BLQ':4, 'ALQ':5, 'GLQ':6}
In [216]:
full.BsmtFinType1=full.BsmtFinType1.apply(lambda x: dict1[x])
full.BsmtFinType2=full.BsmtFinType2.apply(lambda x: dict1[x])
In [217]:
full.BsmtExposure.unique()
Out[217]:
array(['No', 'Gd', 'Mn', 'Av', 'No Basement'], dtype=object)
In [218]:
dict1={'No Basement':0, 'No':1, 'Mn':2, 'Av':3, 'Gd':4}
In [219]:
full.BsmtExposure=full.BsmtExposure.apply(lambda x: dict1[x])
In [220]:
full.BsmtQual.unique()
Out[220]:
array(['Gd', 'TA', 'Ex', 'No Basement', 'Fa'], dtype=object)
In [221]:
dict1={'No Basement':0,'Po':1,'TA':2,'Fa':3,'Gd':4,'Ex':5}
In [222]:
full.BsmtQual=full.BsmtQual.apply(lambda x: dict1[x])
In [223]:
full.BsmtCond.unique()
Out[223]:
array(['TA', 'Gd', 'No Basement', 'Fa', 'Po'], dtype=object)
In [224]:
full.BsmtCond=full.BsmtCond.apply(lambda x: dict1[x])
Now let's work with MSZoning.
In [448]:
full.MSZoning.hist()
plt.xlabel("MSZoning")
plt.ylabel("Count")
plt.show()
We will dummy encode this feature later.
Now let's work with Street.
In [449]:
full.Street.hist()
plt.xlabel("Street")
plt.ylabel("Count")
plt.show()
We delete this feature later.
In [225]:
delete.append('Street')
Now let's work with LotShape.
In [452]:
full.LotShape.hist()
plt.xlabel("LotShape")
plt.ylabel("Count")
plt.show()
We will dummy encode this feature later.
Now let's work with LandContour.
In [453]:
full.LandContour.hist()
plt.xlabel("LandContour")
plt.ylabel("Count")
plt.show()
We will dummy encode this feature later.
Now let's work with Utilities.
In [454]:
full.Utilities.hist()
plt.xlabel("Utilities")
plt.ylabel("Count")
plt.show()
In [226]:
full[full.Utilities!="AllPub"]
Out[226]:
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
944 20 RL 80.0 14375 Pave No Alley IR1 Lvl NoSeWa CulDSac ... 233 0 0 No Fence None 0 1 2009 COD Abnorml

1 rows × 78 columns

There is only one entry which has a different value for the feature Utilities.
This feature seems useless.
We will delete this feature later.
In [227]:
delete.append("Utilities")
Now let's work with LotConfig.
In [457]:
full.LotConfig.hist()
plt.xlabel("LotConfig")
plt.ylabel("Count")
plt.show()
We will dummy encode this feature later.
Now let's work with LandSlope.
In [458]:
full.LandSlope.hist()
plt.xlabel("LandSlope")
plt.ylabel("Count")
plt.show()
We will delete this feature later.
In [228]:
delete.append("LandSlope")
Now let's work with Neighborhood.
In [460]:
full.Neighborhood.hist()
plt.xlabel("Neighborhood")
plt.ylabel("Count")
plt.xticks(rotation=90)
plt.show()
We will dummy encode this feature later.
Now let's work with Condition1.
In [461]:
full.Condition1.hist()
plt.xlabel("Condition1")
plt.ylabel("Count")
plt.show()
We will delete this feature later.
In [229]:
delete.append("Condition1")
Now let's work with Condition2.
In [463]:
full.Condition2.hist()
plt.xlabel("Condition2")
plt.ylabel("Count")
plt.show()
We will delete this feature later.
In [230]:
delete.append("Condition2")
Now let's work with BldgType.
In [465]:
full.BldgType.hist()
plt.xlabel("BldgType")
plt.ylabel("Count")
plt.show()
We will dummy encode this feature later.
Now let's work with HouseStyle.
In [466]:
full.HouseStyle.hist()
plt.xlabel("HouseStyle")
plt.ylabel("Count")
plt.show()
We will dummy encode this feature later.
Now let's work with RoofStyle.
In [467]:
full.RoofStyle.hist()
plt.xlabel("RoofStyle")
plt.ylabel("Count")
plt.show()
We will dummy encode this feature later.
Now let's work with RoofMatl.
In [468]:
full.RoofMatl.hist()
plt.xlabel("RoofMatl")
plt.ylabel("Count")
plt.show()
We will delete this feature later.
In [231]:
delete.append("RoofMatl")
Now let's work with Exterior1st.
In [470]:
full.Exterior1st.hist()
plt.xlabel("Exterior1st")
plt.ylabel("Count")
plt.xticks(rotation=90)
plt.show()
We will dummy encode this feature later.
Now let's work with Exterior2nd.
In [471]:
full.Exterior2nd.hist()
plt.xlabel("Exterior2nd")
plt.ylabel("Count")
plt.xticks(rotation=90)
plt.show()
The above two distributions look the same.
We will delete one of them.
In [232]:
delete.append('Exterior2nd')
Now let's work with MasVnrType.
In [473]:
full.MasVnrType.hist()
plt.xlabel("MasVnrType")
plt.ylabel("Count")
plt.xticks(rotation=90)
plt.show()
We will dummy encode this feature later.
Now let's work with ExterQual.
In [474]:
full.ExterQual.hist()
plt.xlabel("ExterQual")
plt.ylabel("Count")
plt.xticks(rotation=90)
plt.show()
In [233]:
dict1={'Po':1,'TA':2,'Fa':3,'Gd':4,'Ex':5}
In [234]:
full.ExterQual=full.ExterQual.apply(lambda x: dict1[x])
Now let's work with ExterCond.
In [477]:
full.ExterCond.hist()
plt.xlabel("ExterCond")
plt.ylabel("Count")
plt.show()
In [235]:
full.ExterCond=full.ExterCond.apply(lambda x: dict1[x])
Now let's work with Foundation.
In [479]:
full.Foundation.hist()
plt.xlabel("Foundation")
plt.ylabel("Count")
plt.show()
We will dummy encode this feature later.
Now let's work with Heating.
In [480]:
full.Heating.hist()
plt.xlabel("Heating")
plt.ylabel("Count")
plt.xticks(rotation=90)
plt.show()
We will remove this feature later.
In [236]:
delete.append("Heating")
Now let's work with HeatingQC.
In [482]:
full.HeatingQC.hist()
plt.xlabel("HeatingQC")
plt.ylabel("Count")
plt.show()
In [237]:
full.HeatingQC=full.HeatingQC.apply(lambda x: dict1[x])
Now let's work with CentralAir.
In [484]:
full.CentralAir.hist()
plt.xlabel("CentralAir")
plt.ylabel("Count")
plt.show()
Let's convert 'Y' to "1" and 'N' to "0". 'Y' is greater than 'N' because with central air you get a higher price.
In [238]:
full.CentralAir=full.CentralAir.apply(lambda x: 1 if x=='Y' else 0)
Now let's work with Electrical.
In [486]:
full.Electrical.hist()
plt.xlabel("Electrical")
plt.ylabel("Count")
plt.show()
We will delete this feature later.
In [239]:
delete.append("Electrical")
Now let's work with KitchenQual.
In [488]:
full.KitchenQual.hist()
plt.xlabel("KitchenQual")
plt.ylabel("Count")
plt.show()
In [240]:
full.KitchenQual=full.KitchenQual.apply(lambda x: dict1[x])
Now let's work with Functional .
In [490]:
full.Functional.hist()
plt.xlabel("Functional")
plt.ylabel("Count")
plt.show()
We will delete this feature later.
In [241]:
delete.append("Functional")
Now let's work with GarageType.
In [492]:
full.GarageType.hist()
plt.xlabel("GarageType")
plt.ylabel("Count")
plt.show()
We will dummy encode this feature later.
Now let's work with PavedDrive.
In [493]:
full.PavedDrive.hist()
plt.xlabel("PavedDrive")
plt.ylabel("Count")
plt.show()
We will delete this feature later.
In [242]:
delete.append("PavedDrive")
Now let's work with SaleType.
In [243]:
full.SaleType.isna().sum()
Out[243]:
0
In [496]:
full.SaleType.hist()
plt.xlabel("SaleType")
plt.ylabel("Count")
plt.show()
We will dummy encode this feature later.
Now let's work with SaleCondition.
In [497]:
full.SaleCondition.hist()
plt.xlabel("Sale Condition")
plt.ylabel("Count")
plt.show()
We will dummy encode this feature later.
We are now done with categorical and numerical features.
Let's try adding some new features.
We have some feature that are areas of finished and unfininshed basement and we have a feature that is the total basement array. Let's check if there is a relation.
In [244]:
(full.BsmtFinSF1+full.BsmtFinSF2+full.BsmtUnfSF).equals(full.TotalBsmtSF)
Out[244]:
True
We can remove the features BsmtFinSF1, BsmtFinSF2, and BsmtUnfSF.
In [245]:
delete.append("BsmtFinSF1")
delete.append("BsmtFinSF2")
delete.append("BsmtUnfSF")
Let's also check if the sum of 1stFlrSF, 2ndFlrSF, LowQualFinSF is equal to GrLivArea.
In [246]:
(full['1stFlrSF']+full['2ndFlrSF']+full.LowQualFinSF).equals(full.GrLivArea)
Out[246]:
True
We can remove the features 1stFlrSF, 2ndFlrSF, and LowQualFinSF.
In [247]:
delete.append("1stFlrSF")
delete.append("2ndFlrSF")
delete.append("LowQualFinSF")
Let's look at the "Bathroom" features now.
In [248]:
print(full.FullBath.corr(y))
print(full.HalfBath.corr(y))
print(full.BsmtFullBath.corr(y))
print(full.BsmtHalfBath.corr(y))
0.5718673954241733
0.2997792902100934
0.2212091220474469
-0.017281473694899883
Let's see if we add the bathroom features together into one feature that may have a better correlation.
Here each half bathroom is multiplied by 0.5.
Credits of this idea goes to Erik Bruin.
In [249]:
full['Total_Bathrooms']=(full.FullBath+full.HalfBath*0.5+full.BsmtFullBath+full.BsmtHalfBath*0.5)
In [250]:
full.Total_Bathrooms.corr(y)
Out[250]:
0.6391271749177662
We can remove the other Bath features.
In [251]:
delete.append("FullBath")
delete.append("HalfBath")
delete.append("BsmtFullBath")
delete.append("BsmtHalfBath")
Let's look at the "Porch" features.
In [252]:
print(full.OpenPorchSF.corr(y))
print(full.EnclosedPorch.corr(y))
print(full['3SsnPorch'].corr(y))
print(full.ScreenPorch.corr(y))
0.31554792275247034
-0.11978981907208046
0.02082628480262346
0.11989055290139497
Let's see if adding them improves the correlation with the target feature.
In [253]:
(full.OpenPorchSF+full.EnclosedPorch+full['3SsnPorch']+full.ScreenPorch).corr(y)
Out[253]:
0.19782953308159243
It doesn't but I will add them as the correlation in either way is low.
In [254]:
full['TotalPorchSF']=full.OpenPorchSF+full.EnclosedPorch+full['3SsnPorch']+full.ScreenPorch
In [255]:
delete.append("OpenPorchSF")
delete.append("EnclosedPorch")
delete.append("3SsnPorch")
delete.append("ScreenPorch")
In [256]:
full.drop(delete,axis=1,inplace=True)
I will add a feature which describes the age of the house since the last time it was remodeled.
In [257]:
full['Age']=full.YrSold.astype('int')-full.YearRemodAdd.astype('int')
Let's now check the correlation matrix and try to remove highly correlated features.
In [259]:
plt.subplots(figsize=(15,10))
ax=sns.heatmap(full.corr(),cmap=sns.color_palette("RdBu_r", 5),linewidths=.7)
cbar = ax.collections[0].colorbar
cbar.set_ticks(np.arange(-1,1.1,0.2))
plt.title("Heatmap of Numerical Features")
plt.show()
Highly correlated features are chosen whose correlation is shown in red above. We need to delete one of the highly correlated features. To choose one of them, we will compare the correlation of each with the target value and keep the one with the highest correlation.
In [260]:
#compare which of the highly correlated features has a higher correlation with the price
full.PoolArea.corr(y)>full.PoolQC.corr(y)
Out[260]:
False
In [261]:
#compare which of the highly correlated features has a higher correlation
full.GarageQual.corr(y)>full.GarageCond.corr(y)
Out[261]:
True
In [262]:
#compare which of the highly correlated features has a higher correlation
full.GarageArea.corr(y)>full.GarageCars.corr(y)
Out[262]:
False
In [263]:
#compare which of the highly correlated features has a higher correlation
full.ExterQual.corr(y)>full.KitchenQual.corr(y)
Out[263]:
True
In [264]:
#drop highly correlated features
full.drop(['PoolArea','GarageCond','GarageArea',"TotRmsAbvGrd",'KitchenQual'],axis=1,inplace=True)

Training

In [265]:
numerical_features,categorical_features=get_features(full)
We are now going to dummy encode the data.
We have an option to drop a column of the dummy variables. This is to ensure we don't fall in the dummy variable trap.
In [266]:
full_wd=full.copy() #wd=with dummy
full_wod=full.copy() #wod= without dummy

full_wd=pd.get_dummies(full_wd)
full_wod=pd.get_dummies(full_wod,drop_first=True)

train_wd=full_wd.iloc[:index,:]
test_wd=full_wd.iloc[index:,:]

train_wod=full_wod.iloc[:index,:]
test_wod=full_wod.iloc[index:,:]
We are going to try several models.
We will install the nedded one here.
In [267]:
!pip install lightgbm
Collecting lightgbm
  Downloading https://files.pythonhosted.org/packages/bf/01/45e209af10fd16537df0c5d8a5474c286554c3eaf9ddb0ce04113f1e8506/lightgbm-2.1.1-py2.py3-none-manylinux1_x86_64.whl (711kB)
    100% |████████████████████████████████| 716kB 1.2MB/s ta 0:00:01
Requirement already satisfied: scipy in /opt/conda/lib/python3.6/site-packages (from lightgbm)
Requirement already satisfied: numpy in /opt/conda/lib/python3.6/site-packages (from lightgbm)
Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.6/site-packages (from lightgbm)
Installing collected packages: lightgbm
Successfully installed lightgbm-2.1.1
You are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
We define the models and assess them here.
Here is our function for calculating the mean rsme and standard deviation of the rsme.
In [268]:
def model_score(name,model,mtrain,y,cv=50):
    
    #mse are by default negative in sklearn
    mse=-cross_val_score(model,mtrain,y,cv=cv,scoring='neg_mean_squared_error')
    rmse=np.sqrt(mse)
    average_rmse=np.mean(rmse)
    std=np.std(rmse)
    #print(rmse)
    print("%s: The average RMSE is: %f with a standard deviation of %f." %(name,average_rmse,std))
    return average_rmse,std
Out of the models that we will use, we have some which are meta-estimators which might give us better results. For more info,click here.
In [269]:
#define models
from sklearn.linear_model import LinearRegression ,ElasticNet, Lasso, LassoLars,HuberRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor,AdaBoostRegressor,BaggingRegressor,ExtraTreesRegressor
from sklearn.svm import  SVR
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
In [270]:
models=[LinearRegression(),ElasticNet(),Lasso(),LassoLars(),HuberRegressor(),RandomForestRegressor(),\
        GradientBoostingRegressor(),AdaBoostRegressor(),BaggingRegressor(), ExtraTreesRegressor(),SVR(),\
        LGBMRegressor(),XGBRegressor()]


#LR:Linear Regressor
#EN:Elastic Net
#L:Lasso
#Lar:LassoLars
#HR:HuberRegressor
#RF:Random Forest
#GBR:Gradient Boosting Regressor
#Ada:AdaBoostRegressor
#Bag:BaggingRegressor
#Extra:ExtraTreesRegressor
#SVR:Support Vector Regressor
#LGBMR:Light GBM Regressor
#XGBR:XGB Regressor

names=["LR","EN","L","Lar","HR","RF","GBR","Ada","Bag","Extra", "SVR","LGBMR","XGBR"]
Although we don't have to apply k-fold cross validation for ensemble methods such as Random Forests and others since by definition random forests are several decision trees, we will do k-fold cross validation on these models to study the standard deviation of these models.
In [190]:
#lists to save rsme and std of each model
errors=[]
stds=[]
cv_=50
#iterate through models,calculate rsme and std,store them in lists
for name,model in zip(names,models):
    rsme,std=model_score(name,model,train_wod,log_y,cv_)
    errors.append(rsme)
    stds.append(std)
#bar plot of model rsme
bar1=plt.bar(names,errors,label="RSME")
bar2=plt.bar(names,stds,label="Variance")
#we change the color of the bar with min rsme and min std to lighter colors
bar1[np.argmin(errors)].set_color('#bae8f5')
bar2[np.argmin(stds)].set_color('#ffcc66')
plt.legend()
LR: The average RMSE is: 0.128947 with a standard deviation of 0.042698.
EN: The average RMSE is: 0.172177 with a standard deviation of 0.051970.
L: The average RMSE is: 0.179298 with a standard deviation of 0.052386.
Lar: The average RMSE is: 0.398519 with a standard deviation of 0.064352.
HR: The average RMSE is: 0.170894 with a standard deviation of 0.052063.
RF: The average RMSE is: 0.148484 with a standard deviation of 0.036599.
GBR: The average RMSE is: 0.127117 with a standard deviation of 0.039404.
Ada: The average RMSE is: 0.174108 with a standard deviation of 0.033708.
Bag: The average RMSE is: 0.148079 with a standard deviation of 0.041187.
Extra: The average RMSE is: 0.146243 with a standard deviation of 0.035813.
SVR: The average RMSE is: 0.396843 with a standard deviation of 0.065563.
LGBMR: The average RMSE is: 0.128632 with a standard deviation of 0.035147.
XGBR: The average RMSE is: 0.126893 with a standard deviation of 0.038967.
Out[190]:
<matplotlib.legend.Legend at 0x7f8d04a53198>
In [192]:
#lists to save rsme and std of each model
errors=[]
stds=[]

#iterate through models,calculate rsme and std,store them in lists
for name,model in zip(names,models):
    rsme,std=model_score(name,model,train_wd,log_y)
    errors.append(rsme)
    stds.append(std)
#bar plot of model rsme
bar1=plt.bar(names,errors,label="RSME")
bar2=plt.bar(names,stds,label="Variance")
#we change the color of the bar with min rsme and min std to lighter colors
bar1[np.argmin(errors)].set_color('#bae8f5')
bar2[np.argmin(stds)].set_color('#ffcc66')
plt.legend()
LR: The average RMSE is: 0.128944 with a standard deviation of 0.042698.
EN: The average RMSE is: 0.172177 with a standard deviation of 0.051970.
L: The average RMSE is: 0.179298 with a standard deviation of 0.052386.
Lar: The average RMSE is: 0.398519 with a standard deviation of 0.064352.
HR: The average RMSE is: 0.172001 with a standard deviation of 0.052760.
RF: The average RMSE is: 0.145920 with a standard deviation of 0.039825.
GBR: The average RMSE is: 0.127319 with a standard deviation of 0.039875.
Ada: The average RMSE is: 0.174115 with a standard deviation of 0.032819.
Bag: The average RMSE is: 0.144649 with a standard deviation of 0.037005.
Extra: The average RMSE is: 0.143764 with a standard deviation of 0.034898.
SVR: The average RMSE is: 0.396795 with a standard deviation of 0.065595.
LGBMR: The average RMSE is: 0.128511 with a standard deviation of 0.034714.
XGBR: The average RMSE is: 0.126725 with a standard deviation of 0.036599.
Out[192]:
<matplotlib.legend.Legend at 0x7f8d048a8978>
  • Results don't seem better. Feature engineering done was not the best.
  • Models vary a lot in their results.
  • In the next approach (and hopefully the final), we will try different version of feature engineering,try to optimize the models by parameter tuning and choose the best one.

Approach #4

In [271]:
#import cleaned data
full=full_cleaned.copy()
Having our cleaned data, we need to select the appropriate features. It turns out that there are some common ways to do so:
  • RFE: Recursive Forward Elimination
  • Fitting a random forest and checking the importance of each feature
  • Fitting a regularized linear model and checking the importance of each feature
  • Using VIF to eliminate unimportant features.
  • Trying to get a sense of the data and choosing the features manually.

In this first phase, we will just fit a random forest and a regularized linear model and check the importances of the feature. RFE is a good way to go but we have many features and models to choose from,so we will first try to find the best models and then use RFE to see if we get better results.
Before doing so, we have some features which are considered as numerical but in fact should be categorical.
Let's fix that.
In [272]:
numerical_features,categorical_features=get_features(full)
In [273]:
numerical_features
Out[273]:
Index(['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces',
       'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold'],
      dtype='object')
In [274]:
#MSSubClass should be categorical
full['MSSubClass'] = full['MSSubClass'].apply(str)

#Year and month sold should also be categorical.
full['YrSold'] = full['YrSold'].astype(str)
full['MoSold'] = full['MoSold'].astype(str)
In [275]:
numerical_features,categorical_features=get_features(full)
In [276]:
numerical_features
Out[276]:
Index(['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars',
       'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'PoolArea', 'MiscVal'],
      dtype='object')
In [277]:
categorical_features
Out[277]:
Index(['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
       'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
       'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl',
       'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond',
       'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
       'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical',
       'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC',
       'Fence', 'MiscFeature', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition'],
      dtype='object')
Let's us encode some categorical features into numbers.
The code is copied form the previous approach and we added a few things.
In [278]:
dict1={'No Pool':0,'Fa':1,'Gd':2,'Ex':3}
full.PoolQC=full.PoolQC.apply(lambda x: dict1[x])

dict1={'NA':0,'Po':1,'TA':2,'Fa':3,'Gd':4,'Ex':5}
full.FireplaceQu=full.FireplaceQu.apply(lambda x: dict1[x])

dict1={'No Garage':0,'Po':1,'TA':2,'Fa':3,'Gd':4,'Ex':5}
full.GarageCond=full.GarageCond.apply(lambda x: dict1[x])
full.GarageQual=full.GarageQual.apply(lambda x: dict1[x])

dict1={'No Garage':0,'Unf':1,'RFn':2,'Fin':3}
full.GarageFinish=full.GarageFinish.apply(lambda x: dict1[x])

dict1={'No Basement':0, 'Unf':1, 'LwQ':2, 'Rec':3, 'BLQ':4, 'ALQ':5, 'GLQ':6}
full.BsmtFinType1=full.BsmtFinType1.apply(lambda x: dict1[x])
full.BsmtFinType2=full.BsmtFinType2.apply(lambda x: dict1[x])

dict1={'No Basement':0, 'No':1, 'Mn':2, 'Av':3, 'Gd':4}
full.BsmtExposure=full.BsmtExposure.apply(lambda x: dict1[x])

dict1={'No Basement':0,'Po':1,'TA':2,'Fa':3,'Gd':4,'Ex':5}
full.BsmtQual=full.BsmtQual.apply(lambda x: dict1[x])
full.BsmtCond=full.BsmtCond.apply(lambda x: dict1[x])

dict1={'Po':1,'TA':2,'Fa':3,'Gd':4,'Ex':5}
full.ExterQual=full.ExterQual.apply(lambda x: dict1[x])
full.ExterCond=full.ExterCond.apply(lambda x: dict1[x])
full.HeatingQC=full.HeatingQC.apply(lambda x: dict1[x])
full.KitchenQual=full.KitchenQual.apply(lambda x: dict1[x])


full.CentralAir=full.CentralAir.apply(lambda x: 1 if x=='Y' else 0)


#added these

#When home functionality is typical,price is higher and it decreases as functionality decreases.
dict1={'Sal':0, 'Sev':1, 'Maj2':2, 'Maj1':3, 'Mod':4, 'Min2':5, 'Min1':6, 'Typ':7}
full.Functional=full.Functional.apply(lambda x: dict1[x])

#Lands with gentle slope usually cost more than lands with moderate slope which usually cost more than lands with severe slopes.
dict1={'Sev':0,'Mod':1,'Gtl':2}
full.LandSlope=full.LandSlope.apply(lambda x: dict1[x])



#A paved street costs more, so let's encode 'Pave' with 1 and 'Grvl' with 0.
full.Street=full.Street.apply(lambda x: 1 if x=='Pave' else 0)

#A paved drive costs more.
dict1={'N':0,'P':1,'Y':2}
full.PavedDrive=full.PavedDrive.apply(lambda x: dict1[x])
Let us also add the features that we added in the last approach.
We will not drop features here,instead we will keep them and will see its importance as a feature for the model later.
We will also add some new features.(credits)
In [279]:
full['Total_Bathrooms']=(full.FullBath+full.HalfBath*0.5+full.BsmtFullBath+full.BsmtHalfBath*0.5)

full['TotalPorchSF']=full.OpenPorchSF+full.EnclosedPorch+full['3SsnPorch']+full.ScreenPorch

#2010 because this dataset was available in 2010
full['Age']=2010-full.YearBuilt

#added these 
full["Remodeled"] = (full["YearRemodAdd"] != full["YearBuilt"]) * 1

area_cols = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
             'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 
             'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'LowQualFinSF', 'PoolArea' ]

full["TotalArea"] =full[area_cols].sum(axis=1)
In [280]:
numerical_features,categorical_features=get_features(full)
But first before fitting the data, we need to encode the categorical variables.
In [281]:
full_hot=pd.get_dummies(full.copy(),drop_first=True)
                                              
Now let us see the importance of the features. We will compare feature importances obtained from both lasso model and random forest.
In [282]:
#lasso feature importances
lasso=Lasso()
In [283]:
lasso.fit(full_hot.copy().iloc[:index,:],log_y)
FI_lasso = pd.DataFrame({"Feature Importance":lasso.coef_}, index=full_hot.columns)
FI_lasso[FI_lasso["Feature Importance"]!=0].sort_values("Feature Importance").plot(kind="barh",figsize=(15,25))
plt.show()
In [284]:
#try with random forest
rf=RandomForestRegressor()
In [285]:
rf.fit(full_hot.copy().iloc[:index,:],log_y)
FI_rf = pd.DataFrame({"Feature Importance":rf.feature_importances_}, index=full_hot.columns)
FI_rf.sort_values("Feature Importance").iloc[-20:,:].plot(kind="barh",figsize=(15,25))
plt.show()
The results obtained by Lasso seem weird.
A lot of features that should be there such as OverallQual are not.
On the other hand,the features obtained with the random forest are much more reasonable.
Thus we will take the features obtained with random forest.
The problem here is to choose a certain threshold by which a feature is important.
Inspecting the feature importances values, we choose a threshold of 0.0001. Although the threshold might seem small,but there are much less values and choosing higher thresholds will remove a lot of features.
In [287]:
threshold=0.0001
In [288]:
imp_features=FI_rf[FI_rf['Feature Importance']>threshold].index
In [289]:
full_imp=full_hot.copy()[imp_features]
We now have the most important features.
Let's also extract features using VIF.
The variance inflation factor is a measure for the increase of the variance of the parameter estimates if an additional variable, given by exog_idx is added to the linear regression. It is a measure for multicollinearity of the design matrix, exog.
One recommendation is that if VIF is greater than 5, then the explanatory variable given by exog_idx is highly collinear with the other explanatory variables, and the parameter estimates will have large standard errors because of this.( source)
In [290]:
!pip install statsmodels
Requirement already satisfied: statsmodels in /opt/conda/lib/python3.6/site-packages
You are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
In [291]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
In [292]:
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(full_hot.values, i) for i in range(full_hot.shape[1])]
vif["features"] = full_hot.columns
In [293]:
#choose important features
imp_features=vif[vif["VIF Factor"]>5].features.values

full_vif=full_hot.copy()[imp_features]
We have now two different data sets.
We are going to generate more data sets by scaling the data and applying a Box Cox Transformation for the skewed features.
We could have just used the function "log" but applying the box cox transformation gave slightly better results. Credits
For more info about the box cox transformation, click here.
First let's transform the skewed data.
We will use a robust scaler which better deals with outliers. From these 2 datasets, we will produce another 2 where we apply the box cox transformation for skewed variables.
In [294]:
from scipy.special import boxcox1p

#for full_imp
skew=pd.DataFrame()
skew['Skew']=full_imp.skew()
skewed_columns=skew[skew>0.7].dropna().index
full_imp_skewed=full_imp.copy()
full_imp_skewed[skewed_columns]=boxcox1p(full_imp_skewed[skewed_columns],0.15)


#for full_vif
skew=pd.DataFrame()
skew['Skew']=full_vif.skew()
skewed_columns=skew[skew>0.7].dropna().index
full_vif_skewed=full_vif.copy()
full_vif_skewed[skewed_columns]=boxcox1p(full_vif_skewed[skewed_columns],0.15)
Now let's transform the skewed data.
We will use a robust scaler which better deals with outliers. From the 4 datasets we have, we will produce another 4 where are a scaled version.
In [295]:
from sklearn.preprocessing import RobustScaler
In [296]:
scaler=RobustScaler()
In [297]:
full_imp_scaled=full_imp.copy()
full_imp_scaled[full_imp.columns]=scaler.fit_transform(full_imp_scaled)
In [298]:
full_imp_skewed_scaled=full_imp_skewed.copy()
full_imp_skewed_scaled[full_imp_skewed.columns]=scaler.fit_transform(full_imp_skewed_scaled)
In [299]:
full_vif_scaled=full_vif.copy()
full_vif_scaled[full_vif_scaled.columns]=scaler.fit_transform(full_vif_scaled)
In [300]:
full_vif_skewed_scaled=full_vif_skewed.copy()
full_vif_skewed_scaled[full_vif_skewed_scaled.columns]=scaler.fit_transform(full_vif_skewed_scaled)
We now have 8 different data sets.
Let's join them in one list.
In [301]:
datasets=[full_imp,full_imp_skewed,full_imp_scaled,full_imp_skewed_scaled,full_vif,full_vif_skewed,full_vif_scaled,full_vif_skewed_scaled]
In [302]:
datasets_names=["full_imp","full_imp_skewed","full_imp_scaled","full_imp_skewed_scaled","full_vif","full_vif_skewed","full_vif_scaled","full_vif_skewed_scaled"]
We can now train our models.
In [118]:
from sklearn.cross_validation import cross_val_score
def model_score(name,model,mtrain,y,cv=50):
    
    #mse are by default negative in sklearn
    mse=-cross_val_score(model,mtrain,y,cv=cv,scoring='neg_mean_squared_error')
    rmse=np.sqrt(mse)
    average_rmse=np.mean(rmse)
    std=np.std(rmse)
    #print(rmse)
    print("%s: The average RMSE is: %f with a standard deviation of %f." %(name,average_rmse,std))
    return average_rmse,std
In [306]:
#define models
from sklearn.linear_model import LinearRegression ,ElasticNet, Lasso, LassoLars,HuberRegressor,Ridge,BayesianRidge
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor,AdaBoostRegressor,BaggingRegressor,ExtraTreesRegressor
from sklearn.svm import  SVR
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
In [120]:
models=[LinearRegression(),ElasticNet(),Lasso(),LassoLars(),HuberRegressor(),RandomForestRegressor(),\
        GradientBoostingRegressor(n_estimators=500,loss='huber'),AdaBoostRegressor(),BaggingRegressor(), ExtraTreesRegressor(),SVR(),\
        LGBMRegressor(objective='regression',n_estimators=500),XGBRegressor(n_estimators=500),KernelRidge(kernel='polynomial'),Ridge(),BayesianRidge()]


#LR:Linear Regressor
#EN:Elastic Net
#L:Lasso
#Lar:LassoLars
#HR:HuberRegressor
#RF:Random Forest
#GBR:Gradient Boosting Regressor
#Ada:AdaBoostRegressor
#Bag:BaggingRegressor
#Extra:ExtraTreesRegressor
#SVR:Support Vector Regressor
#LGBMR:Light GBM Regressor
#XGBR:XGB Regressor
#KR: Kernel Ridge
#R:Ridge
#BR:Bayesian Ridge
names=["LR","EN","L","Lar","HR","RF","GBR","Ada","Bag","Extra", "SVR","LGBMR","XGBR","KR","R","BR"]
In [264]:
for i in range(len(datasets)):
    data=datasets[i]
    
    #get train dataset
    train=data.iloc[:index,:]

    #lists to save rsme and std of each model
    errors=[]
    stds=[]
    cv_=50
    #iterate through models,calculate rsme and std,store them in lists
    for name,model in zip(names,models):
        rsme,std=model_score(name,model,train,log_y,cv_)
        errors.append(rsme)
        stds.append(std)
    #bar plot of model rsme
    plt.figure()
    plt.title(datasets_names[i])
    plt.ylim([0,0.5])
    bar1=plt.bar(names,errors,label="RSME")
    bar2=plt.bar(names,stds,label="Variance")
    #we change the color of the bar with min rsme and min std to lighter colors
    bar1[np.argmin(errors)].set_color('#bae8f5')
    bar2[np.argmin(stds)].set_color('#ffcc66')
    plt.legend()
    plt.show()
LR: The average RMSE is: 0.124118 with a standard deviation of 0.045235.
EN: The average RMSE is: 0.163913 with a standard deviation of 0.051284.
L: The average RMSE is: 0.171157 with a standard deviation of 0.052029.
Lar: The average RMSE is: 0.398519 with a standard deviation of 0.064352.
HR: The average RMSE is: 0.162140 with a standard deviation of 0.051853.
RF: The average RMSE is: 0.143477 with a standard deviation of 0.036276.
GBR: The average RMSE is: 0.123962 with a standard deviation of 0.041423.
Ada: The average RMSE is: 0.170529 with a standard deviation of 0.034180.
Bag: The average RMSE is: 0.143416 with a standard deviation of 0.040869.
Extra: The average RMSE is: 0.142532 with a standard deviation of 0.035357.
SVR: The average RMSE is: 0.397461 with a standard deviation of 0.065301.
LGBMR: The average RMSE is: 0.124818 with a standard deviation of 0.033953.
XGBR: The average RMSE is: 0.124171 with a standard deviation of 0.038004.
KR: The average RMSE is: 5.031851 with a standard deviation of 22.653322.
R: The average RMSE is: 0.122880 with a standard deviation of 0.045303.
BR: The average RMSE is: 0.120183 with a standard deviation of 0.045882.
LR: The average RMSE is: 0.122452 with a standard deviation of 0.036513.
EN: The average RMSE is: 0.259958 with a standard deviation of 0.049730.
L: The average RMSE is: 0.263615 with a standard deviation of 0.051129.
Lar: The average RMSE is: 0.398519 with a standard deviation of 0.064352.
HR: The average RMSE is: 0.282480 with a standard deviation of 0.065255.
RF: The average RMSE is: 0.148326 with a standard deviation of 0.041908.
GBR: The average RMSE is: 0.124987 with a standard deviation of 0.041626.
Ada: The average RMSE is: 0.170051 with a standard deviation of 0.032370.
Bag: The average RMSE is: 0.145537 with a standard deviation of 0.037147.
Extra: The average RMSE is: 0.142720 with a standard deviation of 0.037727.
SVR: The average RMSE is: 0.319470 with a standard deviation of 0.065922.
LGBMR: The average RMSE is: 0.124807 with a standard deviation of 0.033930.
XGBR: The average RMSE is: 0.124069 with a standard deviation of 0.038015.
KR: The average RMSE is: 0.296542 with a standard deviation of 0.114131.
R: The average RMSE is: 0.120369 with a standard deviation of 0.036691.
BR: The average RMSE is: 0.118133 with a standard deviation of 0.036353.
LR: The average RMSE is: 0.124220 with a standard deviation of 0.045304.
EN: The average RMSE is: 0.392857 with a standard deviation of 0.064027.
L: The average RMSE is: 0.393253 with a standard deviation of 0.063663.
Lar: The average RMSE is: 0.398519 with a standard deviation of 0.064352.
HR: The average RMSE is: 1.458720 with a standard deviation of 0.278596.
RF: The average RMSE is: 0.147080 with a standard deviation of 0.038503.
GBR: The average RMSE is: 0.124353 with a standard deviation of 0.042702.
Ada: The average RMSE is: 0.170995 with a standard deviation of 0.033179.
Bag: The average RMSE is: 0.147939 with a standard deviation of 0.037449.
Extra: The average RMSE is: 0.144019 with a standard deviation of 0.036886.
SVR: The average RMSE is: 0.195149 with a standard deviation of 0.063452.
LGBMR: The average RMSE is: 0.125113 with a standard deviation of 0.033151.
XGBR: The average RMSE is: 0.124279 with a standard deviation of 0.038021.
KR: The average RMSE is: 1.105860 with a standard deviation of 1.339801.
R: The average RMSE is: 0.122721 with a standard deviation of 0.045285.
BR: The average RMSE is: 0.119941 with a standard deviation of 0.045148.
LR: The average RMSE is: 0.125612 with a standard deviation of 0.039048.
EN: The average RMSE is: 0.398519 with a standard deviation of 0.064352.
L: The average RMSE is: 0.398519 with a standard deviation of 0.064352.
Lar: The average RMSE is: 0.398519 with a standard deviation of 0.064352.
HR: The average RMSE is: 0.145000 with a standard deviation of 0.042663.
RF: The average RMSE is: 0.146933 with a standard deviation of 0.038445.
GBR: The average RMSE is: 0.123915 with a standard deviation of 0.043270.
Ada: The average RMSE is: 0.173475 with a standard deviation of 0.034072.
Bag: The average RMSE is: 0.144262 with a standard deviation of 0.039097.
Extra: The average RMSE is: 0.141477 with a standard deviation of 0.038251.
SVR: The average RMSE is: 0.128022 with a standard deviation of 0.041803.
LGBMR: The average RMSE is: 0.125096 with a standard deviation of 0.033130.
XGBR: The average RMSE is: 0.124178 with a standard deviation of 0.038033.
KR: The average RMSE is: 0.133004 with a standard deviation of 0.043544.
R: The average RMSE is: 0.120311 with a standard deviation of 0.036709.
BR: The average RMSE is: 0.117944 with a standard deviation of 0.036482.
LR: The average RMSE is: 190.731013 with a standard deviation of 1080.517915.
EN: The average RMSE is: 0.174541 with a standard deviation of 0.053177.
L: The average RMSE is: 0.177836 with a standard deviation of 0.052648.
Lar: The average RMSE is: 0.398519 with a standard deviation of 0.064352.
HR: The average RMSE is: 0.175643 with a standard deviation of 0.053243.
RF: The average RMSE is: 0.151557 with a standard deviation of 0.038649.
GBR: The average RMSE is: 0.142244 with a standard deviation of 0.042493.
Ada: The average RMSE is: 0.178623 with a standard deviation of 0.035352.
Bag: The average RMSE is: 0.153763 with a standard deviation of 0.039021.
Extra: The average RMSE is: 0.150416 with a standard deviation of 0.039792.
SVR: The average RMSE is: 0.397433 with a standard deviation of 0.065311.
LGBMR: The average RMSE is: 0.145108 with a standard deviation of 0.033010.
XGBR: The average RMSE is: 0.139838 with a standard deviation of 0.040911.
KR: The average RMSE is: 2.857032 with a standard deviation of 10.161093.
R: The average RMSE is: 0.138667 with a standard deviation of 0.043520.
BR: The average RMSE is: 0.135393 with a standard deviation of 0.045562.
LR: The average RMSE is: 3603.501612 with a standard deviation of 22687.128280.
EN: The average RMSE is: 0.275552 with a standard deviation of 0.052568.
L: The average RMSE is: 0.277220 with a standard deviation of 0.052972.
Lar: The average RMSE is: 0.398519 with a standard deviation of 0.064352.
HR: The average RMSE is: 0.261023 with a standard deviation of 0.070999.
RF: The average RMSE is: 0.153671 with a standard deviation of 0.039764.
GBR: The average RMSE is: 0.143295 with a standard deviation of 0.044679.
Ada: The average RMSE is: 0.178823 with a standard deviation of 0.033318.
Bag: The average RMSE is: 0.151137 with a standard deviation of 0.039644.
Extra: The average RMSE is: 0.148088 with a standard deviation of 0.038925.
SVR: The average RMSE is: 0.295764 with a standard deviation of 0.060064.
LGBMR: The average RMSE is: 0.145044 with a standard deviation of 0.032920.
XGBR: The average RMSE is: 0.139805 with a standard deviation of 0.040917.
KR: The average RMSE is: 0.537151 with a standard deviation of 0.224184.
R: The average RMSE is: 0.134610 with a standard deviation of 0.035835.
BR: The average RMSE is: 0.132392 with a standard deviation of 0.036994.
LR: The average RMSE is: 0.145249 with a standard deviation of 0.046849.
EN: The average RMSE is: 0.398127 with a standard deviation of 0.077649.
L: The average RMSE is: 0.397766 with a standard deviation of 0.074542.
Lar: The average RMSE is: 0.398519 with a standard deviation of 0.064352.
HR: The average RMSE is: 2.150668 with a standard deviation of 0.381973.
RF: The average RMSE is: 0.154684 with a standard deviation of 0.039999.
GBR: The average RMSE is: 0.142801 with a standard deviation of 0.044055.
Ada: The average RMSE is: 0.178569 with a standard deviation of 0.033476.
Bag: The average RMSE is: 0.156491 with a standard deviation of 0.037155.
Extra: The average RMSE is: 0.149674 with a standard deviation of 0.040028.
SVR: The average RMSE is: 0.218709 with a standard deviation of 0.062930.
LGBMR: The average RMSE is: 0.146080 with a standard deviation of 0.032177.
XGBR: The average RMSE is: 0.139831 with a standard deviation of 0.040903.
KR: The average RMSE is: 103.532565 with a standard deviation of 698.953681.
R: The average RMSE is: 0.138670 with a standard deviation of 0.043473.
BR: The average RMSE is: 0.135543 with a standard deviation of 0.044571.
LR: The average RMSE is: 11475979.424061 with a standard deviation of 50230689.945059.
EN: The average RMSE is: 0.398519 with a standard deviation of 0.064352.
L: The average RMSE is: 0.398519 with a standard deviation of 0.064352.
Lar: The average RMSE is: 0.398519 with a standard deviation of 0.064352.
HR: The average RMSE is: 0.203946 with a standard deviation of 0.072745.
RF: The average RMSE is: 0.154108 with a standard deviation of 0.040019.
GBR: The average RMSE is: 0.142570 with a standard deviation of 0.044024.
Ada: The average RMSE is: 0.177743 with a standard deviation of 0.034859.
Bag: The average RMSE is: 0.154938 with a standard deviation of 0.039456.
Extra: The average RMSE is: 0.151775 with a standard deviation of 0.037277.
SVR: The average RMSE is: 0.147555 with a standard deviation of 0.044955.
LGBMR: The average RMSE is: 0.146068 with a standard deviation of 0.032149.
XGBR: The average RMSE is: 0.139805 with a standard deviation of 0.040917.
KR: The average RMSE is: 0.145806 with a standard deviation of 0.044280.
R: The average RMSE is: 0.134545 with a standard deviation of 0.035828.
BR: The average RMSE is: 0.132278 with a standard deviation of 0.036618.
It seems that features using the Random Forest feature importance have produced beter results than using VIF.
The Bayesian Ridge produced the lowest error in all datasets.

The lowest result was using the data set "full_imp_skewed_scaled",so we will from now on train the models in this data set.
Let's try doing some grid search to try and optimize the models.
We will choose the models with the lowest errors in this data set which are Bayesian Ridge,Ridge,Gradient Boosting Regressor, and XGBoost.
In [303]:
from sklearn.model_selection import GridSearchCV
We start with the Bayesian Ridge.
In [304]:
train=full_imp_skewed_scaled.iloc[:index,:]
In [72]:
parameters=[{'n_iter':[300,1000],'tol':[0.1,0.001],'alpha_1':[1e-6,1e-4],\
             'lambda_1':[1e-6,1e-4]}]

BR=BayesianRidge()
grid_search= GridSearchCV(estimator=BR,param_grid=parameters,scoring='neg_mean_squared_error',cv=50,n_jobs=-1,verbose=1)
grid_search=grid_search.fit(train,log_y)
print("The best score is %f." %(grid_search.best_score_))
print("The best parameters are ",grid_search.best_params_ )
grid_search.grid_scores_
Fitting 50 folds for each of 16 candidates, totalling 800 fits
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   29.4s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed:  3.1min
The best score is -0.015141.
The best parameters are  {'alpha_1': 1e-06, 'n_iter': 300, 'lambda_1': 0.0001, 'tol': 0.001}
[Parallel(n_jobs=-1)]: Done 800 out of 800 | elapsed:  3.3min finished
/usr/local/lib/python3.5/dist-packages/sklearn/model_selection/_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)
Out[72]:
[mean: -0.01515, std: 0.01178, params: {'alpha_1': 1e-06, 'n_iter': 300, 'lambda_1': 1e-06, 'tol': 0.1},
 mean: -0.01514, std: 0.01178, params: {'alpha_1': 1e-06, 'n_iter': 300, 'lambda_1': 1e-06, 'tol': 0.001},
 mean: -0.01515, std: 0.01178, params: {'alpha_1': 1e-06, 'n_iter': 1000, 'lambda_1': 1e-06, 'tol': 0.1},
 mean: -0.01514, std: 0.01178, params: {'alpha_1': 1e-06, 'n_iter': 1000, 'lambda_1': 1e-06, 'tol': 0.001},
 mean: -0.01515, std: 0.01178, params: {'alpha_1': 1e-06, 'n_iter': 300, 'lambda_1': 0.0001, 'tol': 0.1},
 mean: -0.01514, std: 0.01178, params: {'alpha_1': 1e-06, 'n_iter': 300, 'lambda_1': 0.0001, 'tol': 0.001},
 mean: -0.01515, std: 0.01178, params: {'alpha_1': 1e-06, 'n_iter': 1000, 'lambda_1': 0.0001, 'tol': 0.1},
 mean: -0.01514, std: 0.01178, params: {'alpha_1': 1e-06, 'n_iter': 1000, 'lambda_1': 0.0001, 'tol': 0.001},
 mean: -0.01515, std: 0.01178, params: {'alpha_1': 0.0001, 'n_iter': 300, 'lambda_1': 1e-06, 'tol': 0.1},
 mean: -0.01514, std: 0.01178, params: {'alpha_1': 0.0001, 'n_iter': 300, 'lambda_1': 1e-06, 'tol': 0.001},
 mean: -0.01515, std: 0.01178, params: {'alpha_1': 0.0001, 'n_iter': 1000, 'lambda_1': 1e-06, 'tol': 0.1},
 mean: -0.01514, std: 0.01178, params: {'alpha_1': 0.0001, 'n_iter': 1000, 'lambda_1': 1e-06, 'tol': 0.001},
 mean: -0.01515, std: 0.01178, params: {'alpha_1': 0.0001, 'n_iter': 300, 'lambda_1': 0.0001, 'tol': 0.1},
 mean: -0.01514, std: 0.01178, params: {'alpha_1': 0.0001, 'n_iter': 300, 'lambda_1': 0.0001, 'tol': 0.001},
 mean: -0.01515, std: 0.01178, params: {'alpha_1': 0.0001, 'n_iter': 1000, 'lambda_1': 0.0001, 'tol': 0.1},
 mean: -0.01514, std: 0.01178, params: {'alpha_1': 0.0001, 'n_iter': 1000, 'lambda_1': 0.0001, 'tol': 0.001}]
In [308]:
b=BayesianRidge(alpha_1=1e-6,n_iter=300,lambda_1=0.0001,tol=0.001)
mse=-cross_val_score(b,train,log_y,cv=50,scoring='neg_mean_squared_error')
rmse=np.sqrt(mse)
average_rmse=np.mean(rmse)
std=np.std(rmse)
#print(rmse)
print(" The average RMSE is: %f with a standard deviation of %f." %(average_rmse,std))
 The average RMSE is: 0.118915 with a standard deviation of 0.035495.
Let's look at Ridge now.
In [91]:
parameters=[{'alpha':[1.e-4,0.1,1,5,10],'fit_intercept':[True,False],'tol':[0.001,0.1,1]}]

R=Ridge()
grid_search= GridSearchCV(estimator=R,param_grid=parameters,scoring='neg_mean_squared_error',cv=50,n_jobs=-1,verbose=1)
grid_search=grid_search.fit(train,log_y)
print("The best score is %f." %(grid_search.best_score_))
print("The best parameters are ",grid_search.best_params_ )
Fitting 50 folds for each of 30 candidates, totalling 1500 fits
[Parallel(n_jobs=-1)]: Done 199 tasks      | elapsed:    9.4s
[Parallel(n_jobs=-1)]: Done 449 tasks      | elapsed:   25.7s
[Parallel(n_jobs=-1)]: Done 799 tasks      | elapsed:   48.0s
[Parallel(n_jobs=-1)]: Done 1249 tasks      | elapsed:  1.3min
The best score is -0.015119.
The best parameters are  {'alpha': 10, 'fit_intercept': True, 'tol': 0.001}
[Parallel(n_jobs=-1)]: Done 1500 out of 1500 | elapsed:  1.6min finished
In [309]:
R=Ridge(alpha=10,fit_intercept=True,tol=0.001)
mse=-cross_val_score(R,train,log_y,cv=50,scoring='neg_mean_squared_error')
rmse=np.sqrt(mse)
average_rmse=np.mean(rmse)
std=np.std(rmse)
#print(rmse)
print(" The average RMSE is: %f with a standard deviation of %f." %(average_rmse,std))
 The average RMSE is: 0.118841 with a standard deviation of 0.035453.
Let's look at Gradient Boosting Regressor.
In [80]:
parameters=[{'loss':['ls','huber'],'learning_rate':[0.1,0.01,1],'n_estimators':[100,300,500],'max_depth':[3,7,10,15]}]

GBR=GradientBoostingRegressor()
grid_search= GridSearchCV(estimator=GBR,param_grid=parameters,scoring='neg_mean_squared_error',cv=50,n_jobs=-1,verbose=1)
grid_search=grid_search.fit(train,log_y)
print("The best score is %f." %(grid_search.best_score_))
print("The best parameters are ",grid_search.best_params_ )
Fitting 50 folds for each of 72 candidates, totalling 3600 fits
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   31.4s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed:  6.7min
[Parallel(n_jobs=-1)]: Done 1186 tasks      | elapsed: 15.2min
[Parallel(n_jobs=-1)]: Done 1736 tasks      | elapsed: 31.4min
[Parallel(n_jobs=-1)]: Done 2386 tasks      | elapsed: 69.4min
[Parallel(n_jobs=-1)]: Done 3136 tasks      | elapsed: 72.2min
[Parallel(n_jobs=-1)]: Done 3600 out of 3600 | elapsed: 73.6min finished
The best score is -0.016856.
The best parameters are  {'n_estimators': 500, 'learning_rate': 0.1, 'max_depth': 3, 'loss': 'ls'}
In [310]:
GBR=GradientBoostingRegressor(loss='ls',n_estimators=500,learning_rate=0.1,max_depth=3)
mse=-cross_val_score(GBR,train,log_y,cv=50,scoring='neg_mean_squared_error')
rmse=np.sqrt(mse)
average_rmse=np.mean(rmse)
std=np.std(rmse)
#print(rmse)
print(" The average RMSE is: %f with a standard deviation of %f." %(average_rmse,std))
 The average RMSE is: 0.124749 with a standard deviation of 0.040219.
Let's look at XGBoost.
In [78]:
parameters=[{'booster':['gbtree','dart'],'n_estimators':[100,300],'gamma':[0,0.5,1],'max_depth':[3,7,10],\
             'reg_lambda':[0.1,1],'reg_alpha':[0.1,1]}]

XGBR=XGBRegressor()
grid_search= GridSearchCV(estimator=XGBR,param_grid=parameters,scoring='neg_mean_squared_error',cv=50,n_jobs=-1,verbose=1)
grid_search=grid_search.fit(train,log_y)
print("The best score is %f." %(grid_search.best_score_))
print("The best parameters are ",grid_search.best_params_ )
grid_search.grid_scores_
Fitting 50 folds for each of 144 candidates, totalling 7200 fits
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   16.9s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 1186 tasks      | elapsed:  6.2min
[Parallel(n_jobs=-1)]: Done 1736 tasks      | elapsed:  8.7min
[Parallel(n_jobs=-1)]: Done 2386 tasks      | elapsed: 15.4min
[Parallel(n_jobs=-1)]: Done 3136 tasks      | elapsed: 19.1min
[Parallel(n_jobs=-1)]: Done 3986 tasks      | elapsed: 27.4min
[Parallel(n_jobs=-1)]: Done 4936 tasks      | elapsed: 43.8min
[Parallel(n_jobs=-1)]: Done 5986 tasks      | elapsed: 58.5min
[Parallel(n_jobs=-1)]: Done 7136 tasks      | elapsed: 73.3min
[Parallel(n_jobs=-1)]: Done 7200 out of 7200 | elapsed: 74.6min finished
The best score is -0.016699.
The best parameters are  {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0}
/usr/local/lib/python3.5/dist-packages/sklearn/model_selection/_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)
Out[78]:
[mean: -0.01787, std: 0.01214, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01793, std: 0.01138, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01822, std: 0.01056, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01862, std: 0.01114, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01720, std: 0.01209, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01704, std: 0.01135, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01671, std: 0.01050, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01695, std: 0.01072, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01830, std: 0.01195, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01818, std: 0.01135, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01820, std: 0.01003, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01886, std: 0.01067, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01825, std: 0.01202, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01799, std: 0.01128, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01782, std: 0.00992, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01855, std: 0.01053, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01866, std: 0.01187, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01819, std: 0.01101, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01909, std: 0.01050, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01937, std: 0.01104, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01862, std: 0.01191, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01817, std: 0.01107, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01880, std: 0.01044, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01907, std: 0.01090, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.02142, std: 0.01132, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02192, std: 0.01142, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02228, std: 0.01193, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02222, std: 0.01168, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02142, std: 0.01132, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02192, std: 0.01142, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02228, std: 0.01193, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02221, std: 0.01167, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02133, std: 0.01131, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02183, std: 0.01156, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02232, std: 0.01201, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02225, std: 0.01168, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02133, std: 0.01131, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02183, std: 0.01156, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02232, std: 0.01201, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02225, std: 0.01168, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02126, std: 0.01123, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02183, std: 0.01156, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02232, std: 0.01201, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02225, std: 0.01168, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02126, std: 0.01124, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02183, std: 0.01156, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02232, std: 0.01201, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02225, std: 0.01168, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02400, std: 0.01240, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02415, std: 0.01204, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02473, std: 0.01266, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02485, std: 0.01289, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02400, std: 0.01240, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02415, std: 0.01204, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02473, std: 0.01266, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02485, std: 0.01289, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02400, std: 0.01240, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02411, std: 0.01212, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02475, std: 0.01265, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02485, std: 0.01289, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02400, std: 0.01240, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02411, std: 0.01212, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02475, std: 0.01265, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02485, std: 0.01289, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02400, std: 0.01240, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02411, std: 0.01212, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02475, std: 0.01265, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02485, std: 0.01289, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02400, std: 0.01240, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02411, std: 0.01212, params: {'booster': 'gbtree', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02475, std: 0.01265, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02485, std: 0.01289, params: {'booster': 'gbtree', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.01782, std: 0.01214, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01793, std: 0.01138, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01822, std: 0.01056, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01862, std: 0.01114, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01711, std: 0.01205, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01703, std: 0.01122, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01670, std: 0.01052, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01697, std: 0.01070, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01824, std: 0.01195, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01815, std: 0.01132, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01819, std: 0.01002, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01888, std: 0.01066, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01812, std: 0.01204, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01794, std: 0.01127, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01781, std: 0.00995, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01859, std: 0.01052, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01868, std: 0.01188, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01822, std: 0.01109, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01903, std: 0.01041, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01939, std: 0.01104, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0},
 mean: -0.01863, std: 0.01189, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01819, std: 0.01114, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01876, std: 0.01040, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.01907, std: 0.01091, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0},
 mean: -0.02142, std: 0.01132, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02192, std: 0.01142, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02228, std: 0.01193, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02222, std: 0.01168, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02142, std: 0.01132, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02192, std: 0.01142, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02228, std: 0.01193, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02221, std: 0.01167, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02133, std: 0.01131, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02183, std: 0.01156, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02232, std: 0.01201, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02225, std: 0.01168, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02133, std: 0.01131, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02183, std: 0.01156, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02232, std: 0.01201, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02225, std: 0.01168, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02126, std: 0.01123, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02183, std: 0.01156, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02232, std: 0.01201, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02225, std: 0.01168, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 0.5},
 mean: -0.02126, std: 0.01124, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02183, std: 0.01156, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02232, std: 0.01201, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02225, std: 0.01168, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 0.5},
 mean: -0.02400, std: 0.01240, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02415, std: 0.01204, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02473, std: 0.01266, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02485, std: 0.01289, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02400, std: 0.01240, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02415, std: 0.01204, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02473, std: 0.01266, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02485, std: 0.01289, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 3, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02400, std: 0.01240, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02411, std: 0.01212, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02475, std: 0.01265, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02485, std: 0.01289, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02400, std: 0.01240, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02411, std: 0.01212, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02475, std: 0.01265, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02485, std: 0.01289, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 7, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02400, std: 0.01240, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02411, std: 0.01212, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02475, std: 0.01265, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02485, std: 0.01289, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 100, 'gamma': 1},
 mean: -0.02400, std: 0.01240, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02411, std: 0.01212, params: {'booster': 'dart', 'reg_alpha': 0.1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02475, std: 0.01265, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 0.1, 'n_estimators': 300, 'gamma': 1},
 mean: -0.02485, std: 0.01289, params: {'booster': 'dart', 'reg_alpha': 1, 'max_depth': 10, 'reg_lambda': 1, 'n_estimators': 300, 'gamma': 1}]
In [312]:
XGBR=XGBRegressor(booster='dart',reg_alpha=1,max_depth=3,reg_lambda=0.1,n_estimators=300,gamma=0)
mse=-cross_val_score(XGBR,train,log_y,cv=50,scoring='neg_mean_squared_error')
rmse=np.sqrt(mse)
average_rmse=np.mean(rmse)
std=np.std(rmse)
#print(rmse)
print(" The average RMSE is: %f with a standard deviation of %f." %(average_rmse,std))
 The average RMSE is: 0.125201 with a standard deviation of 0.035950.
We see that we get slightly better results/
Let us now try building a better regressor by stacking several regressors together.
We write a simple class that takes some models and returns the average of their predicitons.
In [313]:
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
In [129]:
#https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)   
In [130]:
#average
averaged_models = AveragingModels(models = (b, R, GBR, XGBR))

mse=-cross_val_score(averaged_models,train,log_y,cv=50,scoring='neg_mean_squared_error')
rmse=np.sqrt(mse)
average_rmse=np.mean(rmse)
std=np.std(rmse)
#print(rmse)
print(" The average RMSE is: %f with a standard deviation of %f." %(average_rmse,std))
 The average RMSE is: 0.113766 with a standard deviation of 0.036917.
Now I am going to weight the models arbitrarly and see if we get better results.
In [132]:
#https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        
        w=np.array([0.3,0.3,0.2,0.2])
        
        return np.dot(predictions, w.T)   
In [133]:
#average
averaged_models = AveragingModels(models = (b, R, GBR, XGBR))

mse=-cross_val_score(averaged_models,train,log_y,cv=50,scoring='neg_mean_squared_error')
rmse=np.sqrt(mse)
average_rmse=np.mean(rmse)
std=np.std(rmse)
#print(rmse)
print(" The average RMSE is: %f with a standard deviation of %f." %(average_rmse,std))
 The average RMSE is: 0.113737 with a standard deviation of 0.036959.
let's try other values for the weights.
In [134]:
#https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        
        w=np.array([0.4,0.3,0.2,0.1])
        
        return np.dot(predictions, w.T)   
In [135]:
#average
averaged_models = AveragingModels(models = (b, R, GBR, XGBR))

mse=-cross_val_score(averaged_models,train,log_y,cv=50,scoring='neg_mean_squared_error')
rmse=np.sqrt(mse)
average_rmse=np.mean(rmse)
std=np.std(rmse)
#print(rmse)
print(" The average RMSE is: %f with a standard deviation of %f." %(average_rmse,std))
 The average RMSE is: 0.113836 with a standard deviation of 0.036795.
Let's try one more.
In [136]:
# credits: https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        
        w=np.array([0.5,0.2,0.1,0.1])
        
        return np.dot(predictions, w.T)   
In [137]:
#average
averaged_models = AveragingModels(models = (b, R, GBR, XGBR))

mse=-cross_val_score(averaged_models,train,log_y,cv=50,scoring='neg_mean_squared_error')
rmse=np.sqrt(mse)
average_rmse=np.mean(rmse)
std=np.std(rmse)
#print(rmse)
print(" The average RMSE is: %f with a standard deviation of %f." %(average_rmse,std))
 The average RMSE is: 1.209081 with a standard deviation of 0.024228.
Let's try a Neural Network using Keras.
In [106]:
!pip install keras
Collecting keras
  Downloading https://files.pythonhosted.org/packages/54/e8/eaff7a09349ae9bd40d3ebaf028b49f5e2392c771f294910f75bb608b241/Keras-2.1.6-py2.py3-none-any.whl (339kB)
    100% |################################| 348kB 1.6MB/s ta 0:00:01
Requirement already satisfied: h5py in /usr/local/lib/python3.5/dist-packages (from keras)
Collecting pyyaml (from keras)
  Downloading https://files.pythonhosted.org/packages/4a/85/db5a2df477072b2902b0eb892feb37d88ac635d36245a72a6a69b23b383a/PyYAML-3.12.tar.gz (253kB)
    100% |################################| 256kB 2.0MB/s ta 0:00:01
Requirement already satisfied: numpy>=1.9.1 in /usr/local/lib/python3.5/dist-packages (from keras)
Requirement already satisfied: scipy>=0.14 in /usr/local/lib/python3.5/dist-packages (from keras)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.5/dist-packages (from keras)
Building wheels for collected packages: pyyaml
  Running setup.py bdist_wheel for pyyaml ... done
  Stored in directory: /root/.cache/pip/wheels/03/05/65/bdc14f2c6e09e82ae3e0f13d021e1b6b2481437ea2f207df3f
Successfully built pyyaml
Installing collected packages: pyyaml, keras
Successfully installed keras-2.1.6 pyyaml-3.12
You are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
In [113]:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
In [119]:
def baseline_model():
    model = Sequential()
    model.add(Dense(130, input_dim=130,kernel_initializer='normal', activation='relu'))
    model.add(Dense(80, activation='relu',kernel_initializer='normal'))
    model.add(Dense(40, activation='relu',kernel_initializer='normal'))
    model.add(Dense(20, activation='relu',kernel_initializer='normal'))
    model.add(Dense(10, activation='relu',kernel_initializer='normal'))
    model.add(Dense(1,kernel_initializer='normal'))
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model
In [120]:
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# evaluate model
estimator = KerasRegressor(build_fn=baseline_model, epochs=100, batch_size=5, verbose=0)
In [123]:
kfold = KFold(n_splits=50, random_state=seed)
results = cross_val_score(estimator, train, log_y, cv=kfold,verbose=1)
print("Results: %.2f (%.2f) MSE" % (results.mean(), results.std()))
Results: -0.17 (0.09) MSE
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed: 54.5min finished
The best model seems to be the stacked one.
In [170]:
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        #self.models_ = [clone(x) for x in self.models]
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        
        w=np.array([0.3,0.3,0.2,0.2])
        
        return np.dot(predictions, w.T)   
    
    
In [171]:
#average
averaged_models = AveragingModels(models = (b, R, GBR, XGBR))
mse=-cross_val_score(averaged_models,train,log_y,cv=50,scoring='neg_mean_squared_error')
rmse=np.sqrt(mse)
average_rmse=np.mean(rmse)
std=np.std(rmse)
#print(rmse)
print(" The average RMSE is: %f with a standard deviation of %f." %(average_rmse,std))
 The average RMSE is: 0.113683 with a standard deviation of 0.037014.
In [139]:
test=full_imp_skewed_scaled.iloc[index:,:]
In [145]:
test.shape
Out[145]:
(260, 130)
In [172]:
averaged_models.fit(train,log_y)
y_pred=averaged_models.predict(test)
In [175]:
y_pred=np.exp(y_pred)
In [176]:
submissions=pd.DataFrame({'Id':test_ID,'SalePrice':y_pred})
submissions.to_csv('submission.csv',index=False)
Before we conclude,we want to test another neural network. The following neural network is just for comparison. Credits for the code go entirely to Johnny Liu. He uses a package called Gluon from the MXNet library.
In [178]:
!pip install mxnet
Collecting mxnet
  Downloading https://files.pythonhosted.org/packages/96/98/c9877e100c3d1ac92263bfaba7bb8a49294e099046592040a2ff8620ac61/mxnet-1.1.0.post0-py2.py3-none-manylinux1_x86_64.whl (23.8MB)
    100% |████████████████████████████████| 23.8MB 58kB/s  eta 0:00:01
Requirement already satisfied: numpy<1.15.0,>=1.8.2 in /opt/conda/lib/python3.6/site-packages (from mxnet)
Collecting graphviz<0.9.0,>=0.8.1 (from mxnet)
  Downloading https://files.pythonhosted.org/packages/84/44/21a7fdd50841aaaef224b943f7d10df87e476e181bb926ccf859bcb53d48/graphviz-0.8.3-py2.py3-none-any.whl
Requirement already satisfied: requests<2.19.0,>=2.18.4 in /opt/conda/lib/python3.6/site-packages (from mxnet)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/lib/python3.6/site-packages (from requests<2.19.0,>=2.18.4->mxnet)
Requirement already satisfied: idna<2.7,>=2.5 in /opt/conda/lib/python3.6/site-packages (from requests<2.19.0,>=2.18.4->mxnet)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /opt/conda/lib/python3.6/site-packages (from requests<2.19.0,>=2.18.4->mxnet)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.6/site-packages (from requests<2.19.0,>=2.18.4->mxnet)
Installing collected packages: graphviz, mxnet
Successfully installed graphviz-0.8.3 mxnet-1.1.0.post0
You are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
In [182]:
import mxnet as mx
from mxnet import gluon
from scipy.stats import skew
from scipy.stats.stats import pearsonr
from mxnet import ndarray as nd
from mxnet import autograd
from mxnet import gluon
In [191]:
X_train,X_test=import_data()
In [192]:
all_X = pd.concat((X_train.loc[:, 'MSSubClass':'SaleCondition'],
                      X_test.loc[:, 'MSSubClass':'SaleCondition']))
In [193]:
numnic_feats = all_X.dtypes[all_X.dtypes!='object'].index
all_X[numnic_feats] = all_X[numnic_feats].apply(lambda x: (x-x.mean())/x.std())
In [194]:
all_X = pd.get_dummies(all_X, dummy_na=True)
In [195]:
all_X = all_X.fillna(all_X.mean())
In [198]:
num_train = X_train.shape[0]
X_train = all_X[:num_train].as_matrix()
X_test = all_X[num_train:].as_matrix()
y_train = y.as_matrix()
In [199]:
from mxnet import ndarray as nd
from mxnet import autograd
from mxnet import gluon

X_train = nd.array(X_train)
y_train = nd.array(y_train)
y_train.reshape((num_train, 1))

X_test = nd.array(X_test)
In [200]:
square_loss = gluon.loss.L2Loss()


def get_rmse_log(net, X_train,y_train):
    num_train = X_train.shape[0]
    clipped_preds = nd.clip(net(X_train), 1, float('inf'))
    return nd.sqrt(2*nd.sum(square_loss(nd.log(clipped_preds),nd.log(y_train))/num_train)).asscalar()    
In [201]:
def get_net():
    net = gluon.nn.Sequential()
    with net.name_scope():
        net.add(gluon.nn.Dense(128, activation='relu'))
        net.add(gluon.nn.Dropout(0.01))
#         net.add(gluon.nn.Dense(32, activation='relu'))
#         net.add(gluon.nn.Dropout(0.2))
        net.add(gluon.nn.Dense(1))
    net.initialize()
    return net
In [202]:
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt

def train(net, X_train, y_train, X_test, y_test, epochs,
          verbose_epoch, learning_rate, weight_decay):
    train_loss = []
    if X_test is not None:
        test_loss = []
    batch_size = 100
    dataset_train = gluon.data.ArrayDataset(X_train, y_train)
    data_iter_train = gluon.data.DataLoader(
        dataset_train, batch_size,shuffle=True)
    trainer = gluon.Trainer(net.collect_params(), 'adam',
                            {'learning_rate': learning_rate,
                             'wd': weight_decay})
    net.collect_params().initialize(force_reinit=True)
    for epoch in range(epochs):
        for data, label in data_iter_train:
            with autograd.record():
                output = net(data)
                loss = square_loss(output, label)
            loss.backward()
            trainer.step(batch_size)

            cur_train_loss = get_rmse_log(net, X_train, y_train)
        if epoch > verbose_epoch:
            print("Epoch %d, train loss: %f" % (epoch, cur_train_loss))
        train_loss.append(cur_train_loss)
        if X_test is not None:
            cur_test_loss = get_rmse_log(net, X_test, y_test)
            test_loss.append(cur_test_loss)
    plt.plot(train_loss)
    plt.legend(['train'])
    if X_test is not None:
        plt.plot(test_loss)
        plt.legend(['train','test'])
    plt.show()
    if X_test is not None:
        return cur_train_loss, cur_test_loss
    else:
        return cur_train_loss
In [203]:
def k_fold_cross_valid(k, epochs, verbose_epoch, X_train, y_train,
                       learning_rate, weight_decay):
    assert k > 1
    fold_size = X_train.shape[0] // k
    train_loss_sum = 0.0
    test_loss_sum = 0.0
    for test_i in range(k):
        X_val_test = X_train[test_i * fold_size: (test_i + 1) * fold_size, :]
        y_val_test = y_train[test_i * fold_size: (test_i + 1) * fold_size]

        val_train_defined = False
        for i in range(k):
            if i != test_i:
                X_cur_fold = X_train[i * fold_size: (i + 1) * fold_size, :]
                y_cur_fold = y_train[i * fold_size: (i + 1) * fold_size]
                if not val_train_defined:
                    X_val_train = X_cur_fold
                    y_val_train = y_cur_fold
                    val_train_defined = True
                else:
                    X_val_train = nd.concat(X_val_train, X_cur_fold, dim=0)
                    y_val_train = nd.concat(y_val_train, y_cur_fold, dim=0)
        net = get_net()
        train_loss, test_loss = train(
            net, X_val_train, y_val_train, X_val_test, y_val_test,epochs, verbose_epoch, learning_rate, weight_decay)
        train_loss_sum += train_loss
        print("Test loss: %f" % test_loss)
        test_loss_sum += test_loss
    return train_loss_sum / k, test_loss_sum / k
In [207]:
k = 50
epochs = 100#80#100#80#100
verbose_epoch = 35#75#95
learning_rate = 0.01 #0.001 0.24
weight_decay = 130 #200#80 #100.0 #20
train_loss, test_loss = k_fold_cross_valid(k, epochs, verbose_epoch, X_train,
                                           y_train, learning_rate, weight_decay)
print("%d-fold validation: Avg train loss: %f, Avg test loss: %f" %
      (k, train_loss, test_loss))
Epoch 36, train loss: 0.358370
Epoch 37, train loss: 0.329547
Epoch 38, train loss: 0.304156
Epoch 39, train loss: 0.281300
Epoch 40, train loss: 0.260853
Epoch 41, train loss: 0.243371
Epoch 42, train loss: 0.228948
Epoch 43, train loss: 0.216911
Epoch 44, train loss: 0.206755
Epoch 45, train loss: 0.198232
Epoch 46, train loss: 0.191778
Epoch 47, train loss: 0.186597
Epoch 48, train loss: 0.182579
Epoch 49, train loss: 0.179636
Epoch 50, train loss: 0.177325
Epoch 51, train loss: 0.175631
Epoch 52, train loss: 0.174307
Epoch 53, train loss: 0.173259
Epoch 54, train loss: 0.172454
Epoch 55, train loss: 0.171768
Epoch 56, train loss: 0.171191
Epoch 57, train loss: 0.170652
Epoch 58, train loss: 0.170196
Epoch 59, train loss: 0.169767
Epoch 60, train loss: 0.169482
Epoch 61, train loss: 0.169156
Epoch 62, train loss: 0.168901
Epoch 63, train loss: 0.168714
Epoch 64, train loss: 0.168626
Epoch 65, train loss: 0.168415
Epoch 66, train loss: 0.168280
Epoch 67, train loss: 0.168266
Epoch 68, train loss: 0.168301
Epoch 69, train loss: 0.168384
Epoch 70, train loss: 0.168341
Epoch 71, train loss: 0.168410
Epoch 72, train loss: 0.168418
Epoch 73, train loss: 0.168599
Epoch 74, train loss: 0.168507
Epoch 75, train loss: 0.168739
Epoch 76, train loss: 0.168638
Epoch 77, train loss: 0.168682
Epoch 78, train loss: 0.168664
Epoch 79, train loss: 0.168227
Epoch 80, train loss: 0.168469
Epoch 81, train loss: 0.168285
Epoch 82, train loss: 0.168195
Epoch 83, train loss: 0.168076
Epoch 84, train loss: 0.167744
Epoch 85, train loss: 0.167307
Epoch 86, train loss: 0.167071
Epoch 87, train loss: 0.166384
Epoch 88, train loss: 0.166117
Epoch 89, train loss: 0.165722
Epoch 90, train loss: 0.165048
Epoch 91, train loss: 0.164526
Epoch 92, train loss: 0.163991
Epoch 93, train loss: 0.163025
Epoch 94, train loss: 0.162141
Epoch 95, train loss: 0.161371
Epoch 96, train loss: 0.160612
Epoch 97, train loss: 0.159548
Epoch 98, train loss: 0.158722
Epoch 99, train loss: 0.157808
Test loss: 0.140869
Epoch 36, train loss: 0.264208
Epoch 37, train loss: 0.244609
Epoch 38, train loss: 0.227945
Epoch 39, train loss: 0.214140
Epoch 40, train loss: 0.202930
Epoch 41, train loss: 0.194328
Epoch 42, train loss: 0.187346
Epoch 43, train loss: 0.182347
Epoch 44, train loss: 0.178700
Epoch 45, train loss: 0.176210
Epoch 46, train loss: 0.174308
Epoch 47, train loss: 0.172905
Epoch 48, train loss: 0.171876
Epoch 49, train loss: 0.170997
Epoch 50, train loss: 0.170306
Epoch 51, train loss: 0.169683
Epoch 52, train loss: 0.169178
Epoch 53, train loss: 0.168726
Epoch 54, train loss: 0.168284
Epoch 55, train loss: 0.167926
Epoch 56, train loss: 0.167705
Epoch 57, train loss: 0.167476
Epoch 58, train loss: 0.167378
Epoch 59, train loss: 0.167134
Epoch 60, train loss: 0.167086
Epoch 61, train loss: 0.167210
Epoch 62, train loss: 0.167235
Epoch 63, train loss: 0.167062
Epoch 64, train loss: 0.167313
Epoch 65, train loss: 0.167584
Epoch 66, train loss: 0.167431
Epoch 67, train loss: 0.167629
Epoch 68, train loss: 0.167705
Epoch 69, train loss: 0.167675
Epoch 70, train loss: 0.167822
Epoch 71, train loss: 0.167937
Epoch 72, train loss: 0.167755
Epoch 73, train loss: 0.167722
Epoch 74, train loss: 0.167583
Epoch 75, train loss: 0.167337
Epoch 76, train loss: 0.166973
Epoch 77, train loss: 0.166700
Epoch 78, train loss: 0.166719
Epoch 79, train loss: 0.166273
Epoch 80, train loss: 0.165787
Epoch 81, train loss: 0.165396
Epoch 82, train loss: 0.164940
Epoch 83, train loss: 0.164476
Epoch 84, train loss: 0.163614
Epoch 85, train loss: 0.162581
Epoch 86, train loss: 0.161665
Epoch 87, train loss: 0.160658
Epoch 88, train loss: 0.160159
Epoch 89, train loss: 0.158706
Epoch 90, train loss: 0.157462
Epoch 91, train loss: 0.156222
Epoch 92, train loss: 0.154904
Epoch 93, train loss: 0.153871
Epoch 94, train loss: 0.152790
Epoch 95, train loss: 0.151638
Epoch 96, train loss: 0.150837
Epoch 97, train loss: 0.149618
Epoch 98, train loss: 0.148630
Epoch 99, train loss: 0.147771
Test loss: 0.190945
Epoch 36, train loss: 0.267124
Epoch 37, train loss: 0.246855
Epoch 38, train loss: 0.229879
Epoch 39, train loss: 0.216016
Epoch 40, train loss: 0.204692
Epoch 41, train loss: 0.195842
Epoch 42, train loss: 0.189142
Epoch 43, train loss: 0.184189
Epoch 44, train loss: 0.180288
Epoch 45, train loss: 0.177445
Epoch 46, train loss: 0.175363
Epoch 47, train loss: 0.173874
Epoch 48, train loss: 0.172679
Epoch 49, train loss: 0.171772
Epoch 50, train loss: 0.171068
Epoch 51, train loss: 0.170505
Epoch 52, train loss: 0.170004
Epoch 53, train loss: 0.169629
Epoch 54, train loss: 0.169343
Epoch 55, train loss: 0.169045
Epoch 56, train loss: 0.168902
Epoch 57, train loss: 0.168848
Epoch 58, train loss: 0.168937
Epoch 59, train loss: 0.168878
Epoch 60, train loss: 0.169028
Epoch 61, train loss: 0.169277
Epoch 62, train loss: 0.169507
Epoch 63, train loss: 0.169763
Epoch 64, train loss: 0.170024
Epoch 65, train loss: 0.170285
Epoch 66, train loss: 0.170609
Epoch 67, train loss: 0.171271
Epoch 68, train loss: 0.171523
Epoch 69, train loss: 0.171807
Epoch 70, train loss: 0.172101
Epoch 71, train loss: 0.172324
Epoch 72, train loss: 0.172671
Epoch 73, train loss: 0.172784
Epoch 74, train loss: 0.173078
Epoch 75, train loss: 0.173085
Epoch 76, train loss: 0.173282
Epoch 77, train loss: 0.173501
Epoch 78, train loss: 0.173401
Epoch 79, train loss: 0.173387
Epoch 80, train loss: 0.173016
Epoch 81, train loss: 0.172710
Epoch 82, train loss: 0.172166
Epoch 83, train loss: 0.172064
Epoch 84, train loss: 0.171664
Epoch 85, train loss: 0.170945
Epoch 86, train loss: 0.170266
Epoch 87, train loss: 0.169716
Epoch 88, train loss: 0.169120
Epoch 89, train loss: 0.168491
Epoch 90, train loss: 0.167719
Epoch 91, train loss: 0.167067
Epoch 92, train loss: 0.166179
Epoch 93, train loss: 0.165225
Epoch 94, train loss: 0.164257
Epoch 95, train loss: 0.163442
Epoch 96, train loss: 0.162457
Epoch 97, train loss: 0.161400
Epoch 98, train loss: 0.160597
Epoch 99, train loss: 0.159378
Test loss: 0.190603
Epoch 36, train loss: 0.273645
Epoch 37, train loss: 0.252416
Epoch 38, train loss: 0.234936
Epoch 39, train loss: 0.220489
Epoch 40, train loss: 0.208710
Epoch 41, train loss: 0.199074
Epoch 42, train loss: 0.192034
Epoch 43, train loss: 0.186522
Epoch 44, train loss: 0.182302
Epoch 45, train loss: 0.179208
Epoch 46, train loss: 0.176844
Epoch 47, train loss: 0.175163
Epoch 48, train loss: 0.173855
Epoch 49, train loss: 0.172844
Epoch 50, train loss: 0.172074
Epoch 51, train loss: 0.171444
Epoch 52, train loss: 0.170913
Epoch 53, train loss: 0.170477
Epoch 54, train loss: 0.170108
Epoch 55, train loss: 0.169785
Epoch 56, train loss: 0.169509
Epoch 57, train loss: 0.169336
Epoch 58, train loss: 0.169193
Epoch 59, train loss: 0.169254
Epoch 60, train loss: 0.169178
Epoch 61, train loss: 0.169215
Epoch 62, train loss: 0.169297
Epoch 63, train loss: 0.169325
Epoch 64, train loss: 0.169711
Epoch 65, train loss: 0.169907
Epoch 66, train loss: 0.170293
Epoch 67, train loss: 0.170751
Epoch 68, train loss: 0.170831
Epoch 69, train loss: 0.171007
Epoch 70, train loss: 0.171588
Epoch 71, train loss: 0.171934
Epoch 72, train loss: 0.171955
Epoch 73, train loss: 0.172107
Epoch 74, train loss: 0.172334
Epoch 75, train loss: 0.172215
Epoch 76, train loss: 0.172334
Epoch 77, train loss: 0.171943
Epoch 78, train loss: 0.171994
Epoch 79, train loss: 0.171767
Epoch 80, train loss: 0.171410
Epoch 81, train loss: 0.170318
Epoch 82, train loss: 0.169657
Epoch 83, train loss: 0.169487
Epoch 84, train loss: 0.168771
Epoch 85, train loss: 0.168205
Epoch 86, train loss: 0.167218
Epoch 87, train loss: 0.166725
Epoch 88, train loss: 0.166045
Epoch 89, train loss: 0.164897
Epoch 90, train loss: 0.164499
Epoch 91, train loss: 0.163571
Epoch 92, train loss: 0.162622
Epoch 93, train loss: 0.161339
Epoch 94, train loss: 0.160432
Epoch 95, train loss: 0.159999
Epoch 96, train loss: 0.159022
Epoch 97, train loss: 0.158091
Epoch 98, train loss: 0.157329
Epoch 99, train loss: 0.156679
Test loss: 0.173561
Epoch 36, train loss: 0.265279
Epoch 37, train loss: 0.245101
Epoch 38, train loss: 0.228695
Epoch 39, train loss: 0.215127
Epoch 40, train loss: 0.204086
Epoch 41, train loss: 0.195763
Epoch 42, train loss: 0.189128
Epoch 43, train loss: 0.184137
Epoch 44, train loss: 0.180524
Epoch 45, train loss: 0.177690
Epoch 46, train loss: 0.175740
Epoch 47, train loss: 0.174208
Epoch 48, train loss: 0.173083
Epoch 49, train loss: 0.172179
Epoch 50, train loss: 0.171495
Epoch 51, train loss: 0.170944
Epoch 52, train loss: 0.170377
Epoch 53, train loss: 0.170011
Epoch 54, train loss: 0.169665
Epoch 55, train loss: 0.169378
Epoch 56, train loss: 0.169133
Epoch 57, train loss: 0.169036
Epoch 58, train loss: 0.168782
Epoch 59, train loss: 0.168709
Epoch 60, train loss: 0.168785
Epoch 61, train loss: 0.168810
Epoch 62, train loss: 0.168805
Epoch 63, train loss: 0.168835
Epoch 64, train loss: 0.168878
Epoch 65, train loss: 0.169193
Epoch 66, train loss: 0.169227
Epoch 67, train loss: 0.169245
Epoch 68, train loss: 0.169561
Epoch 69, train loss: 0.169529
Epoch 70, train loss: 0.169516
Epoch 71, train loss: 0.169371
Epoch 72, train loss: 0.169330
Epoch 73, train loss: 0.169298
Epoch 74, train loss: 0.169057
Epoch 75, train loss: 0.168943
Epoch 76, train loss: 0.168509
Epoch 77, train loss: 0.168267
Epoch 78, train loss: 0.167600
Epoch 79, train loss: 0.167156
Epoch 80, train loss: 0.166645
Epoch 81, train loss: 0.165861
Epoch 82, train loss: 0.165372
Epoch 83, train loss: 0.164906
Epoch 84, train loss: 0.164218
Epoch 85, train loss: 0.163475
Epoch 86, train loss: 0.162663
Epoch 87, train loss: 0.162144
Epoch 88, train loss: 0.161552
Epoch 89, train loss: 0.160771
Epoch 90, train loss: 0.159926
Epoch 91, train loss: 0.159248
Epoch 92, train loss: 0.158681
Epoch 93, train loss: 0.157823
Epoch 94, train loss: 0.156875
Epoch 95, train loss: 0.156089
Epoch 96, train loss: 0.155575
Epoch 97, train loss: 0.154678
Epoch 98, train loss: 0.153843
Epoch 99, train loss: 0.153167
Test loss: 0.148445
Epoch 36, train loss: 0.254571
Epoch 37, train loss: 0.236752
Epoch 38, train loss: 0.221799
Epoch 39, train loss: 0.209882
Epoch 40, train loss: 0.200598
Epoch 41, train loss: 0.193127
Epoch 42, train loss: 0.187378
Epoch 43, train loss: 0.183208
Epoch 44, train loss: 0.180053
Epoch 45, train loss: 0.177665
Epoch 46, train loss: 0.175897
Epoch 47, train loss: 0.174478
Epoch 48, train loss: 0.173434
Epoch 49, train loss: 0.172600
Epoch 50, train loss: 0.171897
Epoch 51, train loss: 0.171342
Epoch 52, train loss: 0.170879
Epoch 53, train loss: 0.170467
Epoch 54, train loss: 0.170152
Epoch 55, train loss: 0.169941
Epoch 56, train loss: 0.169754
Epoch 57, train loss: 0.169635
Epoch 58, train loss: 0.169607
Epoch 59, train loss: 0.169664
Epoch 60, train loss: 0.169815
Epoch 61, train loss: 0.169781
Epoch 62, train loss: 0.169928
Epoch 63, train loss: 0.170118
Epoch 64, train loss: 0.170463
Epoch 65, train loss: 0.170681
Epoch 66, train loss: 0.171009
Epoch 67, train loss: 0.171263
Epoch 68, train loss: 0.171901
Epoch 69, train loss: 0.172124
Epoch 70, train loss: 0.172274
Epoch 71, train loss: 0.172448
Epoch 72, train loss: 0.172890
Epoch 73, train loss: 0.172989
Epoch 74, train loss: 0.173297
Epoch 75, train loss: 0.173366
Epoch 76, train loss: 0.173207
Epoch 77, train loss: 0.173098
Epoch 78, train loss: 0.172752
Epoch 79, train loss: 0.172677
Epoch 80, train loss: 0.172595
Epoch 81, train loss: 0.172219
Epoch 82, train loss: 0.171680
Epoch 83, train loss: 0.171281
Epoch 84, train loss: 0.170442
Epoch 85, train loss: 0.169488
Epoch 86, train loss: 0.169250
Epoch 87, train loss: 0.168372
Epoch 88, train loss: 0.167834
Epoch 89, train loss: 0.166859
Epoch 90, train loss: 0.165998
Epoch 91, train loss: 0.164927
Epoch 92, train loss: 0.164132
Epoch 93, train loss: 0.163411
Epoch 94, train loss: 0.162900
Epoch 95, train loss: 0.161870
Epoch 96, train loss: 0.161251
Epoch 97, train loss: 0.160105
Epoch 98, train loss: 0.159418
Epoch 99, train loss: 0.158560
Test loss: 0.126427
Epoch 36, train loss: 0.292918
Epoch 37, train loss: 0.270132
Epoch 38, train loss: 0.250267
Epoch 39, train loss: 0.233007
Epoch 40, train loss: 0.219400
Epoch 41, train loss: 0.208049
Epoch 42, train loss: 0.198942
Epoch 43, train loss: 0.191799
Epoch 44, train loss: 0.186212
Epoch 45, train loss: 0.182173
Epoch 46, train loss: 0.179162
Epoch 47, train loss: 0.176740
Epoch 48, train loss: 0.175066
Epoch 49, train loss: 0.173699
Epoch 50, train loss: 0.172633
Epoch 51, train loss: 0.171805
Epoch 52, train loss: 0.171105
Epoch 53, train loss: 0.170566
Epoch 54, train loss: 0.170030
Epoch 55, train loss: 0.169626
Epoch 56, train loss: 0.169295
Epoch 57, train loss: 0.168898
Epoch 58, train loss: 0.168711
Epoch 59, train loss: 0.168451
Epoch 60, train loss: 0.168304
Epoch 61, train loss: 0.168354
Epoch 62, train loss: 0.168192
Epoch 63, train loss: 0.168259
Epoch 64, train loss: 0.168510
Epoch 65, train loss: 0.168605
Epoch 66, train loss: 0.168641
Epoch 67, train loss: 0.168810
Epoch 68, train loss: 0.168970
Epoch 69, train loss: 0.169289
Epoch 70, train loss: 0.169463
Epoch 71, train loss: 0.169833
Epoch 72, train loss: 0.170005
Epoch 73, train loss: 0.170122
Epoch 74, train loss: 0.170213
Epoch 75, train loss: 0.170155
Epoch 76, train loss: 0.170224
Epoch 77, train loss: 0.170386
Epoch 78, train loss: 0.170118
Epoch 79, train loss: 0.170216
Epoch 80, train loss: 0.169969
Epoch 81, train loss: 0.169738
Epoch 82, train loss: 0.169438
Epoch 83, train loss: 0.168955
Epoch 84, train loss: 0.168751
Epoch 85, train loss: 0.168574
Epoch 86, train loss: 0.167916
Epoch 87, train loss: 0.167562
Epoch 88, train loss: 0.166956
Epoch 89, train loss: 0.166104
Epoch 90, train loss: 0.165800
Epoch 91, train loss: 0.165086
Epoch 92, train loss: 0.164388
Epoch 93, train loss: 0.163435
Epoch 94, train loss: 0.163043
Epoch 95, train loss: 0.162289
Epoch 96, train loss: 0.161542
Epoch 97, train loss: 0.160862
Epoch 98, train loss: 0.159995
Epoch 99, train loss: 0.159472
Test loss: 0.159065
Epoch 36, train loss: 0.292998
Epoch 37, train loss: 0.268990
Epoch 38, train loss: 0.248293
Epoch 39, train loss: 0.231621
Epoch 40, train loss: 0.217530
Epoch 41, train loss: 0.206223
Epoch 42, train loss: 0.197537
Epoch 43, train loss: 0.190819
Epoch 44, train loss: 0.185533
Epoch 45, train loss: 0.181788
Epoch 46, train loss: 0.178885
Epoch 47, train loss: 0.176664
Epoch 48, train loss: 0.175000
Epoch 49, train loss: 0.173708
Epoch 50, train loss: 0.172631
Epoch 51, train loss: 0.171786
Epoch 52, train loss: 0.171071
Epoch 53, train loss: 0.170434
Epoch 54, train loss: 0.169877
Epoch 55, train loss: 0.169386
Epoch 56, train loss: 0.168949
Epoch 57, train loss: 0.168565
Epoch 58, train loss: 0.168150
Epoch 59, train loss: 0.167883
Epoch 60, train loss: 0.167657
Epoch 61, train loss: 0.167299
Epoch 62, train loss: 0.167167
Epoch 63, train loss: 0.167119
Epoch 64, train loss: 0.166881
Epoch 65, train loss: 0.166678
Epoch 66, train loss: 0.166592
Epoch 67, train loss: 0.166494
Epoch 68, train loss: 0.166222
Epoch 69, train loss: 0.166096
Epoch 70, train loss: 0.165934
Epoch 71, train loss: 0.165800
Epoch 72, train loss: 0.165329
Epoch 73, train loss: 0.165014
Epoch 74, train loss: 0.164549
Epoch 75, train loss: 0.164194
Epoch 76, train loss: 0.163766
Epoch 77, train loss: 0.163291
Epoch 78, train loss: 0.162880
Epoch 79, train loss: 0.162220
Epoch 80, train loss: 0.161684
Epoch 81, train loss: 0.161166
Epoch 82, train loss: 0.160534
Epoch 83, train loss: 0.159822
Epoch 84, train loss: 0.159135
Epoch 85, train loss: 0.158559
Epoch 86, train loss: 0.158080
Epoch 87, train loss: 0.157206
Epoch 88, train loss: 0.156699
Epoch 89, train loss: 0.155876
Epoch 90, train loss: 0.155251
Epoch 91, train loss: 0.154578
Epoch 92, train loss: 0.153733
Epoch 93, train loss: 0.152925
Epoch 94, train loss: 0.152265
Epoch 95, train loss: 0.151495
Epoch 96, train loss: 0.150724
Epoch 97, train loss: 0.150214
Epoch 98, train loss: 0.149322
Epoch 99, train loss: 0.148771
Test loss: 0.165271
Epoch 36, train loss: 0.250949
Epoch 37, train loss: 0.233455
Epoch 38, train loss: 0.219188
Epoch 39, train loss: 0.207545
Epoch 40, train loss: 0.198473
Epoch 41, train loss: 0.191321
Epoch 42, train loss: 0.186009
Epoch 43, train loss: 0.182190
Epoch 44, train loss: 0.179151
Epoch 45, train loss: 0.177011
Epoch 46, train loss: 0.175339
Epoch 47, train loss: 0.174100
Epoch 48, train loss: 0.173160
Epoch 49, train loss: 0.172379
Epoch 50, train loss: 0.171742
Epoch 51, train loss: 0.171249
Epoch 52, train loss: 0.170781
Epoch 53, train loss: 0.170417
Epoch 54, train loss: 0.170094
Epoch 55, train loss: 0.169793
Epoch 56, train loss: 0.169663
Epoch 57, train loss: 0.169627
Epoch 58, train loss: 0.169583
Epoch 59, train loss: 0.169443
Epoch 60, train loss: 0.169544
Epoch 61, train loss: 0.169571
Epoch 62, train loss: 0.169689
Epoch 63, train loss: 0.169907
Epoch 64, train loss: 0.170006
Epoch 65, train loss: 0.170192
Epoch 66, train loss: 0.170578
Epoch 67, train loss: 0.170896
Epoch 68, train loss: 0.171122
Epoch 69, train loss: 0.171017
Epoch 70, train loss: 0.171141
Epoch 71, train loss: 0.171435
Epoch 72, train loss: 0.171385
Epoch 73, train loss: 0.171184
Epoch 74, train loss: 0.171251
Epoch 75, train loss: 0.171271
Epoch 76, train loss: 0.171072
Epoch 77, train loss: 0.170670
Epoch 78, train loss: 0.170269
Epoch 79, train loss: 0.170145
Epoch 80, train loss: 0.169753
Epoch 81, train loss: 0.169300
Epoch 82, train loss: 0.169110
Epoch 83, train loss: 0.168266
Epoch 84, train loss: 0.167807
Epoch 85, train loss: 0.167228
Epoch 86, train loss: 0.166356
Epoch 87, train loss: 0.165529
Epoch 88, train loss: 0.164774
Epoch 89, train loss: 0.164271
Epoch 90, train loss: 0.163834
Epoch 91, train loss: 0.162610
Epoch 92, train loss: 0.162001
Epoch 93, train loss: 0.161208
Epoch 94, train loss: 0.160338
Epoch 95, train loss: 0.159803
Epoch 96, train loss: 0.159031
Epoch 97, train loss: 0.158370
Epoch 98, train loss: 0.157515
Epoch 99, train loss: 0.157273
Test loss: 0.118897
Epoch 36, train loss: 0.302814
Epoch 37, train loss: 0.278594
Epoch 38, train loss: 0.257886
Epoch 39, train loss: 0.240270
Epoch 40, train loss: 0.225310
Epoch 41, train loss: 0.212968
Epoch 42, train loss: 0.202917
Epoch 43, train loss: 0.195051
Epoch 44, train loss: 0.188898
Epoch 45, train loss: 0.183969
Epoch 46, train loss: 0.180402
Epoch 47, train loss: 0.177718
Epoch 48, train loss: 0.175778
Epoch 49, train loss: 0.174239
Epoch 50, train loss: 0.173081
Epoch 51, train loss: 0.172178
Epoch 52, train loss: 0.171424
Epoch 53, train loss: 0.170790
Epoch 54, train loss: 0.170317
Epoch 55, train loss: 0.169876
Epoch 56, train loss: 0.169422
Epoch 57, train loss: 0.169102
Epoch 58, train loss: 0.168777
Epoch 59, train loss: 0.168538
Epoch 60, train loss: 0.168261
Epoch 61, train loss: 0.168162
Epoch 62, train loss: 0.168065
Epoch 63, train loss: 0.167951
Epoch 64, train loss: 0.167922
Epoch 65, train loss: 0.167754
Epoch 66, train loss: 0.167708
Epoch 67, train loss: 0.167814
Epoch 68, train loss: 0.167984
Epoch 69, train loss: 0.168065
Epoch 70, train loss: 0.168033
Epoch 71, train loss: 0.168039
Epoch 72, train loss: 0.168029
Epoch 73, train loss: 0.168147
Epoch 74, train loss: 0.168086
Epoch 75, train loss: 0.168157
Epoch 76, train loss: 0.167955
Epoch 77, train loss: 0.167929
Epoch 78, train loss: 0.167759
Epoch 79, train loss: 0.167246
Epoch 80, train loss: 0.167123
Epoch 81, train loss: 0.166591
Epoch 82, train loss: 0.166147
Epoch 83, train loss: 0.165975
Epoch 84, train loss: 0.165286
Epoch 85, train loss: 0.164982
Epoch 86, train loss: 0.164045
Epoch 87, train loss: 0.163376
Epoch 88, train loss: 0.162535
Epoch 89, train loss: 0.162105
Epoch 90, train loss: 0.161528
Epoch 91, train loss: 0.160699
Epoch 92, train loss: 0.159839
Epoch 93, train loss: 0.159214
Epoch 94, train loss: 0.158296
Epoch 95, train loss: 0.157426
Epoch 96, train loss: 0.156954
Epoch 97, train loss: 0.155929
Epoch 98, train loss: 0.155280
Epoch 99, train loss: 0.154294
Test loss: 0.124398
Epoch 36, train loss: 0.306471
Epoch 37, train loss: 0.281687
Epoch 38, train loss: 0.260672
Epoch 39, train loss: 0.242449
Epoch 40, train loss: 0.227389
Epoch 41, train loss: 0.214809
Epoch 42, train loss: 0.204439
Epoch 43, train loss: 0.196433
Epoch 44, train loss: 0.190214
Epoch 45, train loss: 0.185297
Epoch 46, train loss: 0.181550
Epoch 47, train loss: 0.178671
Epoch 48, train loss: 0.176589
Epoch 49, train loss: 0.175013
Epoch 50, train loss: 0.173865
Epoch 51, train loss: 0.172972
Epoch 52, train loss: 0.172245
Epoch 53, train loss: 0.171619
Epoch 54, train loss: 0.171044
Epoch 55, train loss: 0.170550
Epoch 56, train loss: 0.170134
Epoch 57, train loss: 0.169734
Epoch 58, train loss: 0.169505
Epoch 59, train loss: 0.169163
Epoch 60, train loss: 0.168963
Epoch 61, train loss: 0.168853
Epoch 62, train loss: 0.168817
Epoch 63, train loss: 0.168648
Epoch 64, train loss: 0.168422
Epoch 65, train loss: 0.168561
Epoch 66, train loss: 0.168586
Epoch 67, train loss: 0.168639
Epoch 68, train loss: 0.168654
Epoch 69, train loss: 0.168692
Epoch 70, train loss: 0.168897
Epoch 71, train loss: 0.168794
Epoch 72, train loss: 0.168755
Epoch 73, train loss: 0.168584
Epoch 74, train loss: 0.168703
Epoch 75, train loss: 0.168586
Epoch 76, train loss: 0.168588
Epoch 77, train loss: 0.168221
Epoch 78, train loss: 0.168014
Epoch 79, train loss: 0.167890
Epoch 80, train loss: 0.167758
Epoch 81, train loss: 0.167518
Epoch 82, train loss: 0.166842
Epoch 83, train loss: 0.166528
Epoch 84, train loss: 0.166165
Epoch 85, train loss: 0.165406
Epoch 86, train loss: 0.165080
Epoch 87, train loss: 0.164804
Epoch 88, train loss: 0.164144
Epoch 89, train loss: 0.163674
Epoch 90, train loss: 0.163045
Epoch 91, train loss: 0.162175
Epoch 92, train loss: 0.161525
Epoch 93, train loss: 0.161028
Epoch 94, train loss: 0.160163
Epoch 95, train loss: 0.159593
Epoch 96, train loss: 0.159083
Epoch 97, train loss: 0.158203
Epoch 98, train loss: 0.157403
Epoch 99, train loss: 0.156642
Test loss: 0.121114
Epoch 36, train loss: 0.308423
Epoch 37, train loss: 0.284140
Epoch 38, train loss: 0.263059
Epoch 39, train loss: 0.245124
Epoch 40, train loss: 0.229872
Epoch 41, train loss: 0.217319
Epoch 42, train loss: 0.206758
Epoch 43, train loss: 0.198410
Epoch 44, train loss: 0.191980
Epoch 45, train loss: 0.186631
Epoch 46, train loss: 0.182511
Epoch 47, train loss: 0.179368
Epoch 48, train loss: 0.176987
Epoch 49, train loss: 0.175199
Epoch 50, train loss: 0.173848
Epoch 51, train loss: 0.172811
Epoch 52, train loss: 0.172032
Epoch 53, train loss: 0.171355
Epoch 54, train loss: 0.170832
Epoch 55, train loss: 0.170378
Epoch 56, train loss: 0.170020
Epoch 57, train loss: 0.169693
Epoch 58, train loss: 0.169375
Epoch 59, train loss: 0.169208
Epoch 60, train loss: 0.169103
Epoch 61, train loss: 0.169012
Epoch 62, train loss: 0.168962
Epoch 63, train loss: 0.168891
Epoch 64, train loss: 0.168871
Epoch 65, train loss: 0.169002
Epoch 66, train loss: 0.169163
Epoch 67, train loss: 0.169283
Epoch 68, train loss: 0.169498
Epoch 69, train loss: 0.169527
Epoch 70, train loss: 0.169701
Epoch 71, train loss: 0.170019
Epoch 72, train loss: 0.170150
Epoch 73, train loss: 0.170170
Epoch 74, train loss: 0.170301
Epoch 75, train loss: 0.170308
Epoch 76, train loss: 0.170535
Epoch 77, train loss: 0.170489
Epoch 78, train loss: 0.170425
Epoch 79, train loss: 0.170288
Epoch 80, train loss: 0.169948
Epoch 81, train loss: 0.169804
Epoch 82, train loss: 0.169792
Epoch 83, train loss: 0.169636
Epoch 84, train loss: 0.168933
Epoch 85, train loss: 0.168624
Epoch 86, train loss: 0.167848
Epoch 87, train loss: 0.167404
Epoch 88, train loss: 0.166810
Epoch 89, train loss: 0.166319
Epoch 90, train loss: 0.165527
Epoch 91, train loss: 0.164602
Epoch 92, train loss: 0.163753
Epoch 93, train loss: 0.162858
Epoch 94, train loss: 0.162177
Epoch 95, train loss: 0.161249
Epoch 96, train loss: 0.160481
Epoch 97, train loss: 0.159522
Epoch 98, train loss: 0.158401
Epoch 99, train loss: 0.157378
Test loss: 0.138445
Epoch 36, train loss: 0.299495
Epoch 37, train loss: 0.275888
Epoch 38, train loss: 0.255659
Epoch 39, train loss: 0.238315
Epoch 40, train loss: 0.223588
Epoch 41, train loss: 0.211609
Epoch 42, train loss: 0.201818
Epoch 43, train loss: 0.194128
Epoch 44, train loss: 0.188215
Epoch 45, train loss: 0.183857
Epoch 46, train loss: 0.180355
Epoch 47, train loss: 0.177857
Epoch 48, train loss: 0.175968
Epoch 49, train loss: 0.174531
Epoch 50, train loss: 0.173453
Epoch 51, train loss: 0.172616
Epoch 52, train loss: 0.171907
Epoch 53, train loss: 0.171357
Epoch 54, train loss: 0.170891
Epoch 55, train loss: 0.170430
Epoch 56, train loss: 0.170062
Epoch 57, train loss: 0.169771
Epoch 58, train loss: 0.169514
Epoch 59, train loss: 0.169337
Epoch 60, train loss: 0.169264
Epoch 61, train loss: 0.169092
Epoch 62, train loss: 0.168995
Epoch 63, train loss: 0.168935
Epoch 64, train loss: 0.168953
Epoch 65, train loss: 0.169004
Epoch 66, train loss: 0.169117
Epoch 67, train loss: 0.169340
Epoch 68, train loss: 0.169396
Epoch 69, train loss: 0.169450
Epoch 70, train loss: 0.169616
Epoch 71, train loss: 0.169597
Epoch 72, train loss: 0.169579
Epoch 73, train loss: 0.169966
Epoch 74, train loss: 0.169891
Epoch 75, train loss: 0.169876
Epoch 76, train loss: 0.169970
Epoch 77, train loss: 0.169698
Epoch 78, train loss: 0.169788
Epoch 79, train loss: 0.169274
Epoch 80, train loss: 0.169035
Epoch 81, train loss: 0.168719
Epoch 82, train loss: 0.168523
Epoch 83, train loss: 0.167944
Epoch 84, train loss: 0.167465
Epoch 85, train loss: 0.167074
Epoch 86, train loss: 0.166648
Epoch 87, train loss: 0.166155
Epoch 88, train loss: 0.165393
Epoch 89, train loss: 0.164626
Epoch 90, train loss: 0.164032
Epoch 91, train loss: 0.163384
Epoch 92, train loss: 0.162941
Epoch 93, train loss: 0.162148
Epoch 94, train loss: 0.161451
Epoch 95, train loss: 0.160876
Epoch 96, train loss: 0.159974
Epoch 97, train loss: 0.159411
Epoch 98, train loss: 0.158839
Epoch 99, train loss: 0.157943
Test loss: 0.114559
Epoch 36, train loss: 0.328186
Epoch 37, train loss: 0.301406
Epoch 38, train loss: 0.278162
Epoch 39, train loss: 0.257678
Epoch 40, train loss: 0.240154
Epoch 41, train loss: 0.225618
Epoch 42, train loss: 0.213413
Epoch 43, train loss: 0.203567
Epoch 44, train loss: 0.195525
Epoch 45, train loss: 0.189411
Epoch 46, train loss: 0.184777
Epoch 47, train loss: 0.180992
Epoch 48, train loss: 0.178134
Epoch 49, train loss: 0.176014
Epoch 50, train loss: 0.174411
Epoch 51, train loss: 0.173231
Epoch 52, train loss: 0.172263
Epoch 53, train loss: 0.171512
Epoch 54, train loss: 0.170871
Epoch 55, train loss: 0.170312
Epoch 56, train loss: 0.169787
Epoch 57, train loss: 0.169383
Epoch 58, train loss: 0.168888
Epoch 59, train loss: 0.168529
Epoch 60, train loss: 0.168238
Epoch 61, train loss: 0.167907
Epoch 62, train loss: 0.167683
Epoch 63, train loss: 0.167565
Epoch 64, train loss: 0.167250
Epoch 65, train loss: 0.167225
Epoch 66, train loss: 0.167052
Epoch 67, train loss: 0.166960
Epoch 68, train loss: 0.166787
Epoch 69, train loss: 0.166723
Epoch 70, train loss: 0.166667
Epoch 71, train loss: 0.166515
Epoch 72, train loss: 0.166369
Epoch 73, train loss: 0.166145
Epoch 74, train loss: 0.165972
Epoch 75, train loss: 0.165832
Epoch 76, train loss: 0.165602
Epoch 77, train loss: 0.165364
Epoch 78, train loss: 0.164967
Epoch 79, train loss: 0.164617
Epoch 80, train loss: 0.164285
Epoch 81, train loss: 0.163544
Epoch 82, train loss: 0.162928
Epoch 83, train loss: 0.162627
Epoch 84, train loss: 0.162114
Epoch 85, train loss: 0.161599
Epoch 86, train loss: 0.161021
Epoch 87, train loss: 0.160197
Epoch 88, train loss: 0.159564
Epoch 89, train loss: 0.158784
Epoch 90, train loss: 0.158087
Epoch 91, train loss: 0.157196
Epoch 92, train loss: 0.156711
Epoch 93, train loss: 0.155960
Epoch 94, train loss: 0.155173
Epoch 95, train loss: 0.154296
Epoch 96, train loss: 0.153590
Epoch 97, train loss: 0.152526
Epoch 98, train loss: 0.151460
Epoch 99, train loss: 0.150760
Test loss: 0.131007
Epoch 36, train loss: 0.264518
Epoch 37, train loss: 0.244402
Epoch 38, train loss: 0.228160
Epoch 39, train loss: 0.214871
Epoch 40, train loss: 0.204074
Epoch 41, train loss: 0.195820
Epoch 42, train loss: 0.189321
Epoch 43, train loss: 0.184525
Epoch 44, train loss: 0.181061
Epoch 45, train loss: 0.178463
Epoch 46, train loss: 0.176505
Epoch 47, train loss: 0.175136
Epoch 48, train loss: 0.174112
Epoch 49, train loss: 0.173345
Epoch 50, train loss: 0.172789
Epoch 51, train loss: 0.172312
Epoch 52, train loss: 0.171980
Epoch 53, train loss: 0.171679
Epoch 54, train loss: 0.171487
Epoch 55, train loss: 0.171279
Epoch 56, train loss: 0.171355
Epoch 57, train loss: 0.171305
Epoch 58, train loss: 0.171305
Epoch 59, train loss: 0.171376
Epoch 60, train loss: 0.171528
Epoch 61, train loss: 0.171694
Epoch 62, train loss: 0.171954
Epoch 63, train loss: 0.172123
Epoch 64, train loss: 0.172289
Epoch 65, train loss: 0.172667
Epoch 66, train loss: 0.172881
Epoch 67, train loss: 0.173117
Epoch 68, train loss: 0.173178
Epoch 69, train loss: 0.173300
Epoch 70, train loss: 0.173233
Epoch 71, train loss: 0.173187
Epoch 72, train loss: 0.173588
Epoch 73, train loss: 0.173386
Epoch 74, train loss: 0.173653
Epoch 75, train loss: 0.173514
Epoch 76, train loss: 0.173183
Epoch 77, train loss: 0.172858
Epoch 78, train loss: 0.172425
Epoch 79, train loss: 0.172153
Epoch 80, train loss: 0.171532
Epoch 81, train loss: 0.171418
Epoch 82, train loss: 0.171115
Epoch 83, train loss: 0.170624
Epoch 84, train loss: 0.169797
Epoch 85, train loss: 0.169129
Epoch 86, train loss: 0.168477
Epoch 87, train loss: 0.167692
Epoch 88, train loss: 0.166650
Epoch 89, train loss: 0.166021
Epoch 90, train loss: 0.164907
Epoch 91, train loss: 0.163851
Epoch 92, train loss: 0.162738
Epoch 93, train loss: 0.162249
Epoch 94, train loss: 0.161170
Epoch 95, train loss: 0.160661
Epoch 96, train loss: 0.159611
Epoch 97, train loss: 0.158627
Epoch 98, train loss: 0.157897
Epoch 99, train loss: 0.156914
Test loss: 0.347233
Epoch 36, train loss: 0.297631
Epoch 37, train loss: 0.274084
Epoch 38, train loss: 0.253412
Epoch 39, train loss: 0.236039
Epoch 40, train loss: 0.221590
Epoch 41, train loss: 0.209305
Epoch 42, train loss: 0.199572
Epoch 43, train loss: 0.191827
Epoch 44, train loss: 0.185996
Epoch 45, train loss: 0.181858
Epoch 46, train loss: 0.178662
Epoch 47, train loss: 0.176195
Epoch 48, train loss: 0.174603
Epoch 49, train loss: 0.173328
Epoch 50, train loss: 0.172366
Epoch 51, train loss: 0.171586
Epoch 52, train loss: 0.170966
Epoch 53, train loss: 0.170397
Epoch 54, train loss: 0.169957
Epoch 55, train loss: 0.169573
Epoch 56, train loss: 0.169183
Epoch 57, train loss: 0.168984
Epoch 58, train loss: 0.168695
Epoch 59, train loss: 0.168506
Epoch 60, train loss: 0.168420
Epoch 61, train loss: 0.168255
Epoch 62, train loss: 0.168218
Epoch 63, train loss: 0.168324
Epoch 64, train loss: 0.168270
Epoch 65, train loss: 0.168390
Epoch 66, train loss: 0.168281
Epoch 67, train loss: 0.168401
Epoch 68, train loss: 0.168538
Epoch 69, train loss: 0.168423
Epoch 70, train loss: 0.168407
Epoch 71, train loss: 0.168495
Epoch 72, train loss: 0.168612
Epoch 73, train loss: 0.168491
Epoch 74, train loss: 0.168341
Epoch 75, train loss: 0.168301
Epoch 76, train loss: 0.168128
Epoch 77, train loss: 0.168225
Epoch 78, train loss: 0.167832
Epoch 79, train loss: 0.167215
Epoch 80, train loss: 0.167023
Epoch 81, train loss: 0.166554
Epoch 82, train loss: 0.165850
Epoch 83, train loss: 0.165156
Epoch 84, train loss: 0.164668
Epoch 85, train loss: 0.163919
Epoch 86, train loss: 0.163227
Epoch 87, train loss: 0.162396
Epoch 88, train loss: 0.161502
Epoch 89, train loss: 0.160808
Epoch 90, train loss: 0.160068
Epoch 91, train loss: 0.159116
Epoch 92, train loss: 0.158459
Epoch 93, train loss: 0.157555
Epoch 94, train loss: 0.156724
Epoch 95, train loss: 0.156085
Epoch 96, train loss: 0.155470
Epoch 97, train loss: 0.154507
Epoch 98, train loss: 0.153673
Epoch 99, train loss: 0.153201
Test loss: 0.145298
Epoch 36, train loss: 0.295291
Epoch 37, train loss: 0.272537
Epoch 38, train loss: 0.252376
Epoch 39, train loss: 0.235639
Epoch 40, train loss: 0.221551
Epoch 41, train loss: 0.209988
Epoch 42, train loss: 0.200591
Epoch 43, train loss: 0.193307
Epoch 44, train loss: 0.187403
Epoch 45, train loss: 0.182978
Epoch 46, train loss: 0.179572
Epoch 47, train loss: 0.177007
Epoch 48, train loss: 0.175199
Epoch 49, train loss: 0.173773
Epoch 50, train loss: 0.172694
Epoch 51, train loss: 0.171844
Epoch 52, train loss: 0.171178
Epoch 53, train loss: 0.170612
Epoch 54, train loss: 0.170176
Epoch 55, train loss: 0.169759
Epoch 56, train loss: 0.169468
Epoch 57, train loss: 0.169159
Epoch 58, train loss: 0.169006
Epoch 59, train loss: 0.168894
Epoch 60, train loss: 0.168800
Epoch 61, train loss: 0.168689
Epoch 62, train loss: 0.168795
Epoch 63, train loss: 0.168885
Epoch 64, train loss: 0.169030
Epoch 65, train loss: 0.169177
Epoch 66, train loss: 0.169419
Epoch 67, train loss: 0.169749
Epoch 68, train loss: 0.170003
Epoch 69, train loss: 0.170218
Epoch 70, train loss: 0.170595
Epoch 71, train loss: 0.170839
Epoch 72, train loss: 0.171222
Epoch 73, train loss: 0.171471
Epoch 74, train loss: 0.171919
Epoch 75, train loss: 0.172139
Epoch 76, train loss: 0.172370
Epoch 77, train loss: 0.172475
Epoch 78, train loss: 0.172925
Epoch 79, train loss: 0.172709
Epoch 80, train loss: 0.172791
Epoch 81, train loss: 0.172867
Epoch 82, train loss: 0.172778
Epoch 83, train loss: 0.172758
Epoch 84, train loss: 0.172594
Epoch 85, train loss: 0.172573
Epoch 86, train loss: 0.172437
Epoch 87, train loss: 0.171903
Epoch 88, train loss: 0.171673
Epoch 89, train loss: 0.170977
Epoch 90, train loss: 0.171063
Epoch 91, train loss: 0.170259
Epoch 92, train loss: 0.169686
Epoch 93, train loss: 0.169038
Epoch 94, train loss: 0.168644
Epoch 95, train loss: 0.167776
Epoch 96, train loss: 0.166792
Epoch 97, train loss: 0.166048
Epoch 98, train loss: 0.164995
Epoch 99, train loss: 0.164274
Test loss: 0.158343
Epoch 36, train loss: 0.268185
Epoch 37, train loss: 0.247494
Epoch 38, train loss: 0.230104
Epoch 39, train loss: 0.216137
Epoch 40, train loss: 0.204743
Epoch 41, train loss: 0.195886
Epoch 42, train loss: 0.188823
Epoch 43, train loss: 0.183834
Epoch 44, train loss: 0.180276
Epoch 45, train loss: 0.177380
Epoch 46, train loss: 0.175357
Epoch 47, train loss: 0.173887
Epoch 48, train loss: 0.172744
Epoch 49, train loss: 0.171835
Epoch 50, train loss: 0.171118
Epoch 51, train loss: 0.170500
Epoch 52, train loss: 0.169979
Epoch 53, train loss: 0.169587
Epoch 54, train loss: 0.169200
Epoch 55, train loss: 0.168867
Epoch 56, train loss: 0.168673
Epoch 57, train loss: 0.168526
Epoch 58, train loss: 0.168378
Epoch 59, train loss: 0.168351
Epoch 60, train loss: 0.168177
Epoch 61, train loss: 0.168325
Epoch 62, train loss: 0.168393
Epoch 63, train loss: 0.168606
Epoch 64, train loss: 0.168769
Epoch 65, train loss: 0.168800
Epoch 66, train loss: 0.169045
Epoch 67, train loss: 0.169124
Epoch 68, train loss: 0.169282
Epoch 69, train loss: 0.169561
Epoch 70, train loss: 0.169605
Epoch 71, train loss: 0.169773
Epoch 72, train loss: 0.169850
Epoch 73, train loss: 0.169599
Epoch 74, train loss: 0.169453
Epoch 75, train loss: 0.168997
Epoch 76, train loss: 0.168568
Epoch 77, train loss: 0.168155
Epoch 78, train loss: 0.167901
Epoch 79, train loss: 0.167359
Epoch 80, train loss: 0.166468
Epoch 81, train loss: 0.165685
Epoch 82, train loss: 0.165189
Epoch 83, train loss: 0.164526
Epoch 84, train loss: 0.163601
Epoch 85, train loss: 0.162975
Epoch 86, train loss: 0.162107
Epoch 87, train loss: 0.161255
Epoch 88, train loss: 0.160348
Epoch 89, train loss: 0.159450
Epoch 90, train loss: 0.158321
Epoch 91, train loss: 0.157453
Epoch 92, train loss: 0.156421
Epoch 93, train loss: 0.155396
Epoch 94, train loss: 0.154580
Epoch 95, train loss: 0.153468
Epoch 96, train loss: 0.152383
Epoch 97, train loss: 0.151882
Epoch 98, train loss: 0.150933
Epoch 99, train loss: 0.150067
Test loss: 0.146837
Epoch 36, train loss: 0.321603
Epoch 37, train loss: 0.295869
Epoch 38, train loss: 0.273308
Epoch 39, train loss: 0.253537
Epoch 40, train loss: 0.236891
Epoch 41, train loss: 0.222091
Epoch 42, train loss: 0.210101
Epoch 43, train loss: 0.200973
Epoch 44, train loss: 0.193408
Epoch 45, train loss: 0.187541
Epoch 46, train loss: 0.183111
Epoch 47, train loss: 0.179751
Epoch 48, train loss: 0.177120
Epoch 49, train loss: 0.175342
Epoch 50, train loss: 0.173957
Epoch 51, train loss: 0.172872
Epoch 52, train loss: 0.172014
Epoch 53, train loss: 0.171323
Epoch 54, train loss: 0.170822
Epoch 55, train loss: 0.170300
Epoch 56, train loss: 0.169813
Epoch 57, train loss: 0.169442
Epoch 58, train loss: 0.169170
Epoch 59, train loss: 0.168844
Epoch 60, train loss: 0.168593
Epoch 61, train loss: 0.168437
Epoch 62, train loss: 0.168394
Epoch 63, train loss: 0.168289
Epoch 64, train loss: 0.168193
Epoch 65, train loss: 0.168198
Epoch 66, train loss: 0.168359
Epoch 67, train loss: 0.168341
Epoch 68, train loss: 0.168281
Epoch 69, train loss: 0.168303
Epoch 70, train loss: 0.168401
Epoch 71, train loss: 0.168560
Epoch 72, train loss: 0.168560
Epoch 73, train loss: 0.168578
Epoch 74, train loss: 0.168721
Epoch 75, train loss: 0.168691
Epoch 76, train loss: 0.168552
Epoch 77, train loss: 0.168474
Epoch 78, train loss: 0.168326
Epoch 79, train loss: 0.168087
Epoch 80, train loss: 0.167631
Epoch 81, train loss: 0.167369
Epoch 82, train loss: 0.167104
Epoch 83, train loss: 0.166527
Epoch 84, train loss: 0.166066
Epoch 85, train loss: 0.165410
Epoch 86, train loss: 0.164820
Epoch 87, train loss: 0.164268
Epoch 88, train loss: 0.163644
Epoch 89, train loss: 0.162862
Epoch 90, train loss: 0.162285
Epoch 91, train loss: 0.161747
Epoch 92, train loss: 0.160964
Epoch 93, train loss: 0.160207
Epoch 94, train loss: 0.159433
Epoch 95, train loss: 0.158909
Epoch 96, train loss: 0.158068
Epoch 97, train loss: 0.157244
Epoch 98, train loss: 0.156371
Epoch 99, train loss: 0.155732
Test loss: 0.150861
Epoch 36, train loss: 0.304568
Epoch 37, train loss: 0.279785
Epoch 38, train loss: 0.258244
Epoch 39, train loss: 0.240312
Epoch 40, train loss: 0.225259
Epoch 41, train loss: 0.212606
Epoch 42, train loss: 0.202447
Epoch 43, train loss: 0.194108
Epoch 44, train loss: 0.187815
Epoch 45, train loss: 0.183062
Epoch 46, train loss: 0.179507
Epoch 47, train loss: 0.176703
Epoch 48, train loss: 0.174690
Epoch 49, train loss: 0.173149
Epoch 50, train loss: 0.172002
Epoch 51, train loss: 0.171071
Epoch 52, train loss: 0.170339
Epoch 53, train loss: 0.169766
Epoch 54, train loss: 0.169326
Epoch 55, train loss: 0.168876
Epoch 56, train loss: 0.168541
Epoch 57, train loss: 0.168296
Epoch 58, train loss: 0.168078
Epoch 59, train loss: 0.167948
Epoch 60, train loss: 0.167720
Epoch 61, train loss: 0.167740
Epoch 62, train loss: 0.167887
Epoch 63, train loss: 0.167901
Epoch 64, train loss: 0.168036
Epoch 65, train loss: 0.168151
Epoch 66, train loss: 0.168370
Epoch 67, train loss: 0.168506
Epoch 68, train loss: 0.168825
Epoch 69, train loss: 0.169021
Epoch 70, train loss: 0.169366
Epoch 71, train loss: 0.169720
Epoch 72, train loss: 0.169913
Epoch 73, train loss: 0.170335
Epoch 74, train loss: 0.170634
Epoch 75, train loss: 0.170465
Epoch 76, train loss: 0.170491
Epoch 77, train loss: 0.170735
Epoch 78, train loss: 0.170959
Epoch 79, train loss: 0.170877
Epoch 80, train loss: 0.170902
Epoch 81, train loss: 0.170888
Epoch 82, train loss: 0.170706
Epoch 83, train loss: 0.170586
Epoch 84, train loss: 0.170047
Epoch 85, train loss: 0.169607
Epoch 86, train loss: 0.169201
Epoch 87, train loss: 0.168326
Epoch 88, train loss: 0.167571
Epoch 89, train loss: 0.166811
Epoch 90, train loss: 0.165769
Epoch 91, train loss: 0.165213
Epoch 92, train loss: 0.164261
Epoch 93, train loss: 0.162901
Epoch 94, train loss: 0.162126
Epoch 95, train loss: 0.161752
Epoch 96, train loss: 0.160843
Epoch 97, train loss: 0.159817
Epoch 98, train loss: 0.158563
Epoch 99, train loss: 0.157766
Test loss: 0.179612
Epoch 36, train loss: 0.262185
Epoch 37, train loss: 0.242596
Epoch 38, train loss: 0.226045
Epoch 39, train loss: 0.212642
Epoch 40, train loss: 0.202021
Epoch 41, train loss: 0.193732
Epoch 42, train loss: 0.187293
Epoch 43, train loss: 0.182458
Epoch 44, train loss: 0.178764
Epoch 45, train loss: 0.176009
Epoch 46, train loss: 0.173975
Epoch 47, train loss: 0.172524
Epoch 48, train loss: 0.171422
Epoch 49, train loss: 0.170589
Epoch 50, train loss: 0.169917
Epoch 51, train loss: 0.169372
Epoch 52, train loss: 0.168977
Epoch 53, train loss: 0.168633
Epoch 54, train loss: 0.168392
Epoch 55, train loss: 0.168217
Epoch 56, train loss: 0.168021
Epoch 57, train loss: 0.167977
Epoch 58, train loss: 0.167972
Epoch 59, train loss: 0.168102
Epoch 60, train loss: 0.168054
Epoch 61, train loss: 0.168227
Epoch 62, train loss: 0.168446
Epoch 63, train loss: 0.168779
Epoch 64, train loss: 0.169100
Epoch 65, train loss: 0.169333
Epoch 66, train loss: 0.169642
Epoch 67, train loss: 0.169823
Epoch 68, train loss: 0.170044
Epoch 69, train loss: 0.170486
Epoch 70, train loss: 0.170647
Epoch 71, train loss: 0.171058
Epoch 72, train loss: 0.171130
Epoch 73, train loss: 0.171496
Epoch 74, train loss: 0.171648
Epoch 75, train loss: 0.171567
Epoch 76, train loss: 0.171806
Epoch 77, train loss: 0.171910
Epoch 78, train loss: 0.171544
Epoch 79, train loss: 0.171510
Epoch 80, train loss: 0.171359
Epoch 81, train loss: 0.171077
Epoch 82, train loss: 0.170264
Epoch 83, train loss: 0.170149
Epoch 84, train loss: 0.169664
Epoch 85, train loss: 0.169176
Epoch 86, train loss: 0.168764
Epoch 87, train loss: 0.168262
Epoch 88, train loss: 0.167744
Epoch 89, train loss: 0.167157
Epoch 90, train loss: 0.166320
Epoch 91, train loss: 0.165940
Epoch 92, train loss: 0.165447
Epoch 93, train loss: 0.164849
Epoch 94, train loss: 0.163819
Epoch 95, train loss: 0.163419
Epoch 96, train loss: 0.162655
Epoch 97, train loss: 0.161715
Epoch 98, train loss: 0.160820
Epoch 99, train loss: 0.160112
Test loss: 0.184402
Epoch 36, train loss: 0.412465
Epoch 37, train loss: 0.379360
Epoch 38, train loss: 0.349044
Epoch 39, train loss: 0.321256
Epoch 40, train loss: 0.296327
Epoch 41, train loss: 0.274182
Epoch 42, train loss: 0.254746
Epoch 43, train loss: 0.237590
Epoch 44, train loss: 0.223309
Epoch 45, train loss: 0.211346
Epoch 46, train loss: 0.201927
Epoch 47, train loss: 0.194037
Epoch 48, train loss: 0.188032
Epoch 49, train loss: 0.183396
Epoch 50, train loss: 0.179673
Epoch 51, train loss: 0.176947
Epoch 52, train loss: 0.174796
Epoch 53, train loss: 0.173111
Epoch 54, train loss: 0.171877
Epoch 55, train loss: 0.170910
Epoch 56, train loss: 0.170143
Epoch 57, train loss: 0.169445
Epoch 58, train loss: 0.168898
Epoch 59, train loss: 0.168373
Epoch 60, train loss: 0.167978
Epoch 61, train loss: 0.167512
Epoch 62, train loss: 0.167175
Epoch 63, train loss: 0.166866
Epoch 64, train loss: 0.166604
Epoch 65, train loss: 0.166282
Epoch 66, train loss: 0.166174
Epoch 67, train loss: 0.166082
Epoch 68, train loss: 0.165918
Epoch 69, train loss: 0.165854
Epoch 70, train loss: 0.165818
Epoch 71, train loss: 0.165888
Epoch 72, train loss: 0.165853
Epoch 73, train loss: 0.165788
Epoch 74, train loss: 0.165643
Epoch 75, train loss: 0.165711
Epoch 76, train loss: 0.165752
Epoch 77, train loss: 0.165841
Epoch 78, train loss: 0.165818
Epoch 79, train loss: 0.165768
Epoch 80, train loss: 0.165474
Epoch 81, train loss: 0.165355
Epoch 82, train loss: 0.165320
Epoch 83, train loss: 0.164964
Epoch 84, train loss: 0.164730
Epoch 85, train loss: 0.164416
Epoch 86, train loss: 0.164134
Epoch 87, train loss: 0.163732
Epoch 88, train loss: 0.163624
Epoch 89, train loss: 0.162944
Epoch 90, train loss: 0.162371
Epoch 91, train loss: 0.162032
Epoch 92, train loss: 0.161581
Epoch 93, train loss: 0.160979
Epoch 94, train loss: 0.160153
Epoch 95, train loss: 0.159610
Epoch 96, train loss: 0.159117
Epoch 97, train loss: 0.158332
Epoch 98, train loss: 0.157852
Epoch 99, train loss: 0.156948
Test loss: 0.262433
Epoch 36, train loss: 0.335994
Epoch 37, train loss: 0.309438
Epoch 38, train loss: 0.285517
Epoch 39, train loss: 0.264733
Epoch 40, train loss: 0.246678
Epoch 41, train loss: 0.231638
Epoch 42, train loss: 0.218908
Epoch 43, train loss: 0.208664
Epoch 44, train loss: 0.200260
Epoch 45, train loss: 0.193356
Epoch 46, train loss: 0.188143
Epoch 47, train loss: 0.184036
Epoch 48, train loss: 0.180875
Epoch 49, train loss: 0.178477
Epoch 50, train loss: 0.176513
Epoch 51, train loss: 0.174933
Epoch 52, train loss: 0.173719
Epoch 53, train loss: 0.172736
Epoch 54, train loss: 0.172002
Epoch 55, train loss: 0.171321
Epoch 56, train loss: 0.170768
Epoch 57, train loss: 0.170170
Epoch 58, train loss: 0.169663
Epoch 59, train loss: 0.169103
Epoch 60, train loss: 0.168554
Epoch 61, train loss: 0.168144
Epoch 62, train loss: 0.167684
Epoch 63, train loss: 0.167278
Epoch 64, train loss: 0.166955
Epoch 65, train loss: 0.166625
Epoch 66, train loss: 0.166266
Epoch 67, train loss: 0.166057
Epoch 68, train loss: 0.165689
Epoch 69, train loss: 0.165260
Epoch 70, train loss: 0.165076
Epoch 71, train loss: 0.164732
Epoch 72, train loss: 0.164551
Epoch 73, train loss: 0.164283
Epoch 74, train loss: 0.163997
Epoch 75, train loss: 0.163873
Epoch 76, train loss: 0.163427
Epoch 77, train loss: 0.163153
Epoch 78, train loss: 0.162819
Epoch 79, train loss: 0.162493
Epoch 80, train loss: 0.162233
Epoch 81, train loss: 0.161902
Epoch 82, train loss: 0.161526
Epoch 83, train loss: 0.161251
Epoch 84, train loss: 0.160759
Epoch 85, train loss: 0.160450
Epoch 86, train loss: 0.160141
Epoch 87, train loss: 0.159583
Epoch 88, train loss: 0.159171
Epoch 89, train loss: 0.158611
Epoch 90, train loss: 0.158155
Epoch 91, train loss: 0.157739
Epoch 92, train loss: 0.157016
Epoch 93, train loss: 0.156479
Epoch 94, train loss: 0.155788
Epoch 95, train loss: 0.155180
Epoch 96, train loss: 0.154319
Epoch 97, train loss: 0.153713
Epoch 98, train loss: 0.153074
Epoch 99, train loss: 0.152162
Test loss: 0.364516
Epoch 36, train loss: 0.270129
Epoch 37, train loss: 0.250009
Epoch 38, train loss: 0.232989
Epoch 39, train loss: 0.218740
Epoch 40, train loss: 0.207362
Epoch 41, train loss: 0.198334
Epoch 42, train loss: 0.191166
Epoch 43, train loss: 0.186034
Epoch 44, train loss: 0.181902
Epoch 45, train loss: 0.178724
Epoch 46, train loss: 0.176488
Epoch 47, train loss: 0.174816
Epoch 48, train loss: 0.173533
Epoch 49, train loss: 0.172537
Epoch 50, train loss: 0.171731
Epoch 51, train loss: 0.171074
Epoch 52, train loss: 0.170487
Epoch 53, train loss: 0.169993
Epoch 54, train loss: 0.169603
Epoch 55, train loss: 0.169274
Epoch 56, train loss: 0.169013
Epoch 57, train loss: 0.168637
Epoch 58, train loss: 0.168439
Epoch 59, train loss: 0.168424
Epoch 60, train loss: 0.168196
Epoch 61, train loss: 0.168070
Epoch 62, train loss: 0.168133
Epoch 63, train loss: 0.168197
Epoch 64, train loss: 0.168386
Epoch 65, train loss: 0.168581
Epoch 66, train loss: 0.168569
Epoch 67, train loss: 0.168681
Epoch 68, train loss: 0.168737
Epoch 69, train loss: 0.168964
Epoch 70, train loss: 0.169103
Epoch 71, train loss: 0.169158
Epoch 72, train loss: 0.169177
Epoch 73, train loss: 0.169190
Epoch 74, train loss: 0.169236
Epoch 75, train loss: 0.169208
Epoch 76, train loss: 0.169155
Epoch 77, train loss: 0.169313
Epoch 78, train loss: 0.168961
Epoch 79, train loss: 0.168605
Epoch 80, train loss: 0.168363
Epoch 81, train loss: 0.168043
Epoch 82, train loss: 0.167581
Epoch 83, train loss: 0.167223
Epoch 84, train loss: 0.167002
Epoch 85, train loss: 0.166790
Epoch 86, train loss: 0.166226
Epoch 87, train loss: 0.165829
Epoch 88, train loss: 0.165201
Epoch 89, train loss: 0.164229
Epoch 90, train loss: 0.163663
Epoch 91, train loss: 0.162991
Epoch 92, train loss: 0.162482
Epoch 93, train loss: 0.161284
Epoch 94, train loss: 0.160611
Epoch 95, train loss: 0.159657
Epoch 96, train loss: 0.159087
Epoch 97, train loss: 0.157925
Epoch 98, train loss: 0.157394
Epoch 99, train loss: 0.156858
Test loss: 0.146091
Epoch 36, train loss: 0.314663
Epoch 37, train loss: 0.289505
Epoch 38, train loss: 0.267434
Epoch 39, train loss: 0.248358
Epoch 40, train loss: 0.232450
Epoch 41, train loss: 0.219009
Epoch 42, train loss: 0.207874
Epoch 43, train loss: 0.199176
Epoch 44, train loss: 0.192346
Epoch 45, train loss: 0.187027
Epoch 46, train loss: 0.183099
Epoch 47, train loss: 0.180042
Epoch 48, train loss: 0.177669
Epoch 49, train loss: 0.175914
Epoch 50, train loss: 0.174589
Epoch 51, train loss: 0.173568
Epoch 52, train loss: 0.172711
Epoch 53, train loss: 0.172022
Epoch 54, train loss: 0.171423
Epoch 55, train loss: 0.170962
Epoch 56, train loss: 0.170489
Epoch 57, train loss: 0.170116
Epoch 58, train loss: 0.169774
Epoch 59, train loss: 0.169520
Epoch 60, train loss: 0.169274
Epoch 61, train loss: 0.169155
Epoch 62, train loss: 0.169086
Epoch 63, train loss: 0.169020
Epoch 64, train loss: 0.169066
Epoch 65, train loss: 0.169103
Epoch 66, train loss: 0.169154
Epoch 67, train loss: 0.169240
Epoch 68, train loss: 0.169413
Epoch 69, train loss: 0.169482
Epoch 70, train loss: 0.169508
Epoch 71, train loss: 0.169835
Epoch 72, train loss: 0.169997
Epoch 73, train loss: 0.170166
Epoch 74, train loss: 0.170207
Epoch 75, train loss: 0.170338
Epoch 76, train loss: 0.170251
Epoch 77, train loss: 0.170406
Epoch 78, train loss: 0.170376
Epoch 79, train loss: 0.170353
Epoch 80, train loss: 0.170457
Epoch 81, train loss: 0.170491
Epoch 82, train loss: 0.170357
Epoch 83, train loss: 0.170105
Epoch 84, train loss: 0.169544
Epoch 85, train loss: 0.169083
Epoch 86, train loss: 0.168705
Epoch 87, train loss: 0.168106
Epoch 88, train loss: 0.167535
Epoch 89, train loss: 0.166808
Epoch 90, train loss: 0.165955
Epoch 91, train loss: 0.165154
Epoch 92, train loss: 0.164391
Epoch 93, train loss: 0.163481
Epoch 94, train loss: 0.162949
Epoch 95, train loss: 0.162139
Epoch 96, train loss: 0.161290
Epoch 97, train loss: 0.160505
Epoch 98, train loss: 0.159700
Epoch 99, train loss: 0.158905
Test loss: 0.134538
Epoch 36, train loss: 0.297376
Epoch 37, train loss: 0.273279
Epoch 38, train loss: 0.252585
Epoch 39, train loss: 0.235312
Epoch 40, train loss: 0.220835
Epoch 41, train loss: 0.208829
Epoch 42, train loss: 0.199727
Epoch 43, train loss: 0.192390
Epoch 44, train loss: 0.186858
Epoch 45, train loss: 0.182536
Epoch 46, train loss: 0.179436
Epoch 47, train loss: 0.177076
Epoch 48, train loss: 0.175329
Epoch 49, train loss: 0.173965
Epoch 50, train loss: 0.172900
Epoch 51, train loss: 0.172062
Epoch 52, train loss: 0.171385
Epoch 53, train loss: 0.170794
Epoch 54, train loss: 0.170242
Epoch 55, train loss: 0.169826
Epoch 56, train loss: 0.169473
Epoch 57, train loss: 0.169089
Epoch 58, train loss: 0.168838
Epoch 59, train loss: 0.168721
Epoch 60, train loss: 0.168500
Epoch 61, train loss: 0.168348
Epoch 62, train loss: 0.168312
Epoch 63, train loss: 0.168373
Epoch 64, train loss: 0.168318
Epoch 65, train loss: 0.168412
Epoch 66, train loss: 0.168489
Epoch 67, train loss: 0.168739
Epoch 68, train loss: 0.168917
Epoch 69, train loss: 0.168925
Epoch 70, train loss: 0.169117
Epoch 71, train loss: 0.169431
Epoch 72, train loss: 0.169342
Epoch 73, train loss: 0.169348
Epoch 74, train loss: 0.169315
Epoch 75, train loss: 0.169073
Epoch 76, train loss: 0.168713
Epoch 77, train loss: 0.168039
Epoch 78, train loss: 0.167660
Epoch 79, train loss: 0.166886
Epoch 80, train loss: 0.166169
Epoch 81, train loss: 0.165386
Epoch 82, train loss: 0.164360
Epoch 83, train loss: 0.163435
Epoch 84, train loss: 0.162422
Epoch 85, train loss: 0.161461
Epoch 86, train loss: 0.160375
Epoch 87, train loss: 0.158998
Epoch 88, train loss: 0.157941
Epoch 89, train loss: 0.156882
Epoch 90, train loss: 0.155673
Epoch 91, train loss: 0.154423
Epoch 92, train loss: 0.153462
Epoch 93, train loss: 0.152311
Epoch 94, train loss: 0.150946
Epoch 95, train loss: 0.149939
Epoch 96, train loss: 0.149435
Epoch 97, train loss: 0.148265
Epoch 98, train loss: 0.147187
Epoch 99, train loss: 0.146194
Test loss: 0.115850
Epoch 36, train loss: 0.294857
Epoch 37, train loss: 0.272266
Epoch 38, train loss: 0.252796
Epoch 39, train loss: 0.236569
Epoch 40, train loss: 0.222299
Epoch 41, train loss: 0.210857
Epoch 42, train loss: 0.201174
Epoch 43, train loss: 0.193699
Epoch 44, train loss: 0.187738
Epoch 45, train loss: 0.183119
Epoch 46, train loss: 0.179572
Epoch 47, train loss: 0.176900
Epoch 48, train loss: 0.174825
Epoch 49, train loss: 0.173249
Epoch 50, train loss: 0.172032
Epoch 51, train loss: 0.171013
Epoch 52, train loss: 0.170237
Epoch 53, train loss: 0.169544
Epoch 54, train loss: 0.169053
Epoch 55, train loss: 0.168557
Epoch 56, train loss: 0.168147
Epoch 57, train loss: 0.167832
Epoch 58, train loss: 0.167578
Epoch 59, train loss: 0.167422
Epoch 60, train loss: 0.167338
Epoch 61, train loss: 0.167214
Epoch 62, train loss: 0.167100
Epoch 63, train loss: 0.167115
Epoch 64, train loss: 0.167303
Epoch 65, train loss: 0.167476
Epoch 66, train loss: 0.167607
Epoch 67, train loss: 0.167900
Epoch 68, train loss: 0.168192
Epoch 69, train loss: 0.168459
Epoch 70, train loss: 0.168886
Epoch 71, train loss: 0.169129
Epoch 72, train loss: 0.169339
Epoch 73, train loss: 0.169625
Epoch 74, train loss: 0.169909
Epoch 75, train loss: 0.170014
Epoch 76, train loss: 0.169986
Epoch 77, train loss: 0.169853
Epoch 78, train loss: 0.170084
Epoch 79, train loss: 0.169787
Epoch 80, train loss: 0.169509
Epoch 81, train loss: 0.169302
Epoch 82, train loss: 0.168994
Epoch 83, train loss: 0.168523
Epoch 84, train loss: 0.168171
Epoch 85, train loss: 0.167454
Epoch 86, train loss: 0.166843
Epoch 87, train loss: 0.166210
Epoch 88, train loss: 0.165858
Epoch 89, train loss: 0.165181
Epoch 90, train loss: 0.164500
Epoch 91, train loss: 0.163815
Epoch 92, train loss: 0.163141
Epoch 93, train loss: 0.162489
Epoch 94, train loss: 0.161271
Epoch 95, train loss: 0.160498
Epoch 96, train loss: 0.160117
Epoch 97, train loss: 0.159073
Epoch 98, train loss: 0.158598
Epoch 99, train loss: 0.157633
Test loss: 0.281645
Epoch 36, train loss: 0.282741
Epoch 37, train loss: 0.259949
Epoch 38, train loss: 0.240588
Epoch 39, train loss: 0.224464
Epoch 40, train loss: 0.211125
Epoch 41, train loss: 0.200730
Epoch 42, train loss: 0.192618
Epoch 43, train loss: 0.186531
Epoch 44, train loss: 0.181793
Epoch 45, train loss: 0.178384
Epoch 46, train loss: 0.175794
Epoch 47, train loss: 0.174029
Epoch 48, train loss: 0.172698
Epoch 49, train loss: 0.171672
Epoch 50, train loss: 0.170858
Epoch 51, train loss: 0.170179
Epoch 52, train loss: 0.169587
Epoch 53, train loss: 0.169032
Epoch 54, train loss: 0.168562
Epoch 55, train loss: 0.168193
Epoch 56, train loss: 0.167876
Epoch 57, train loss: 0.167502
Epoch 58, train loss: 0.167350
Epoch 59, train loss: 0.167138
Epoch 60, train loss: 0.167135
Epoch 61, train loss: 0.167213
Epoch 62, train loss: 0.167209
Epoch 63, train loss: 0.167184
Epoch 64, train loss: 0.167357
Epoch 65, train loss: 0.167620
Epoch 66, train loss: 0.167723
Epoch 67, train loss: 0.167965
Epoch 68, train loss: 0.168142
Epoch 69, train loss: 0.168322
Epoch 70, train loss: 0.168616
Epoch 71, train loss: 0.168686
Epoch 72, train loss: 0.168749
Epoch 73, train loss: 0.168697
Epoch 74, train loss: 0.168731
Epoch 75, train loss: 0.168618
Epoch 76, train loss: 0.168583
Epoch 77, train loss: 0.168313
Epoch 78, train loss: 0.167829
Epoch 79, train loss: 0.167570
Epoch 80, train loss: 0.166790
Epoch 81, train loss: 0.166491
Epoch 82, train loss: 0.166163
Epoch 83, train loss: 0.165344
Epoch 84, train loss: 0.164914
Epoch 85, train loss: 0.164052
Epoch 86, train loss: 0.163478
Epoch 87, train loss: 0.162693
Epoch 88, train loss: 0.162009
Epoch 89, train loss: 0.160932
Epoch 90, train loss: 0.160351
Epoch 91, train loss: 0.159671
Epoch 92, train loss: 0.158900
Epoch 93, train loss: 0.157936
Epoch 94, train loss: 0.157117
Epoch 95, train loss: 0.156028
Epoch 96, train loss: 0.155220
Epoch 97, train loss: 0.154294
Epoch 98, train loss: 0.153376
Epoch 99, train loss: 0.152448
Test loss: 0.219821
Epoch 36, train loss: 0.312938
Epoch 37, train loss: 0.288170
Epoch 38, train loss: 0.265787
Epoch 39, train loss: 0.247002
Epoch 40, train loss: 0.230820
Epoch 41, train loss: 0.217522
Epoch 42, train loss: 0.206805
Epoch 43, train loss: 0.197902
Epoch 44, train loss: 0.190949
Epoch 45, train loss: 0.185607
Epoch 46, train loss: 0.181316
Epoch 47, train loss: 0.178086
Epoch 48, train loss: 0.175684
Epoch 49, train loss: 0.173829
Epoch 50, train loss: 0.172406
Epoch 51, train loss: 0.171278
Epoch 52, train loss: 0.170395
Epoch 53, train loss: 0.169747
Epoch 54, train loss: 0.169089
Epoch 55, train loss: 0.168546
Epoch 56, train loss: 0.168083
Epoch 57, train loss: 0.167666
Epoch 58, train loss: 0.167391
Epoch 59, train loss: 0.166939
Epoch 60, train loss: 0.166671
Epoch 61, train loss: 0.166458
Epoch 62, train loss: 0.166237
Epoch 63, train loss: 0.166084
Epoch 64, train loss: 0.165934
Epoch 65, train loss: 0.165891
Epoch 66, train loss: 0.165877
Epoch 67, train loss: 0.165889
Epoch 68, train loss: 0.165904
Epoch 69, train loss: 0.165930
Epoch 70, train loss: 0.165849
Epoch 71, train loss: 0.165807
Epoch 72, train loss: 0.165813
Epoch 73, train loss: 0.165946
Epoch 74, train loss: 0.165934
Epoch 75, train loss: 0.165677
Epoch 76, train loss: 0.165582
Epoch 77, train loss: 0.165363
Epoch 78, train loss: 0.165092
Epoch 79, train loss: 0.164975
Epoch 80, train loss: 0.164691
Epoch 81, train loss: 0.164286
Epoch 82, train loss: 0.163899
Epoch 83, train loss: 0.163682
Epoch 84, train loss: 0.163198
Epoch 85, train loss: 0.162447
Epoch 86, train loss: 0.161868
Epoch 87, train loss: 0.161317
Epoch 88, train loss: 0.160547
Epoch 89, train loss: 0.159957
Epoch 90, train loss: 0.159211
Epoch 91, train loss: 0.158358
Epoch 92, train loss: 0.157571
Epoch 93, train loss: 0.156828
Epoch 94, train loss: 0.156299
Epoch 95, train loss: 0.155489
Epoch 96, train loss: 0.154803
Epoch 97, train loss: 0.154129
Epoch 98, train loss: 0.153565
Epoch 99, train loss: 0.152883
Test loss: 0.187975
Epoch 36, train loss: 0.317261
Epoch 37, train loss: 0.290930
Epoch 38, train loss: 0.268050
Epoch 39, train loss: 0.248481
Epoch 40, train loss: 0.231755
Epoch 41, train loss: 0.218257
Epoch 42, train loss: 0.207290
Epoch 43, train loss: 0.198432
Epoch 44, train loss: 0.191617
Epoch 45, train loss: 0.186370
Epoch 46, train loss: 0.182195
Epoch 47, train loss: 0.179261
Epoch 48, train loss: 0.177021
Epoch 49, train loss: 0.175310
Epoch 50, train loss: 0.174019
Epoch 51, train loss: 0.173017
Epoch 52, train loss: 0.172154
Epoch 53, train loss: 0.171428
Epoch 54, train loss: 0.170858
Epoch 55, train loss: 0.170312
Epoch 56, train loss: 0.169828
Epoch 57, train loss: 0.169373
Epoch 58, train loss: 0.168931
Epoch 59, train loss: 0.168648
Epoch 60, train loss: 0.168376
Epoch 61, train loss: 0.168103
Epoch 62, train loss: 0.167905
Epoch 63, train loss: 0.167875
Epoch 64, train loss: 0.167693
Epoch 65, train loss: 0.167603
Epoch 66, train loss: 0.167509
Epoch 67, train loss: 0.167584
Epoch 68, train loss: 0.167554
Epoch 69, train loss: 0.167589
Epoch 70, train loss: 0.167623
Epoch 71, train loss: 0.167682
Epoch 72, train loss: 0.167630
Epoch 73, train loss: 0.167726
Epoch 74, train loss: 0.167707
Epoch 75, train loss: 0.167637
Epoch 76, train loss: 0.167377
Epoch 77, train loss: 0.167380
Epoch 78, train loss: 0.167274
Epoch 79, train loss: 0.167075
Epoch 80, train loss: 0.166857
Epoch 81, train loss: 0.166835
Epoch 82, train loss: 0.166346
Epoch 83, train loss: 0.166065
Epoch 84, train loss: 0.165612
Epoch 85, train loss: 0.165261
Epoch 86, train loss: 0.164461
Epoch 87, train loss: 0.164096
Epoch 88, train loss: 0.163793
Epoch 89, train loss: 0.163157
Epoch 90, train loss: 0.162735
Epoch 91, train loss: 0.162515
Epoch 92, train loss: 0.161692
Epoch 93, train loss: 0.160981
Epoch 94, train loss: 0.160120
Epoch 95, train loss: 0.159354
Epoch 96, train loss: 0.158328
Epoch 97, train loss: 0.157450
Epoch 98, train loss: 0.156756
Epoch 99, train loss: 0.156029
Test loss: 0.151353
Epoch 36, train loss: 0.309546
Epoch 37, train loss: 0.285242
Epoch 38, train loss: 0.264410
Epoch 39, train loss: 0.246656
Epoch 40, train loss: 0.231050
Epoch 41, train loss: 0.218219
Epoch 42, train loss: 0.207800
Epoch 43, train loss: 0.199160
Epoch 44, train loss: 0.192397
Epoch 45, train loss: 0.187079
Epoch 46, train loss: 0.183032
Epoch 47, train loss: 0.179964
Epoch 48, train loss: 0.177517
Epoch 49, train loss: 0.175722
Epoch 50, train loss: 0.174334
Epoch 51, train loss: 0.173251
Epoch 52, train loss: 0.172383
Epoch 53, train loss: 0.171729
Epoch 54, train loss: 0.171167
Epoch 55, train loss: 0.170744
Epoch 56, train loss: 0.170274
Epoch 57, train loss: 0.169972
Epoch 58, train loss: 0.169624
Epoch 59, train loss: 0.169444
Epoch 60, train loss: 0.169292
Epoch 61, train loss: 0.169065
Epoch 62, train loss: 0.168976
Epoch 63, train loss: 0.168932
Epoch 64, train loss: 0.168992
Epoch 65, train loss: 0.169022
Epoch 66, train loss: 0.169093
Epoch 67, train loss: 0.169257
Epoch 68, train loss: 0.169346
Epoch 69, train loss: 0.169568
Epoch 70, train loss: 0.169574
Epoch 71, train loss: 0.169612
Epoch 72, train loss: 0.169952
Epoch 73, train loss: 0.170186
Epoch 74, train loss: 0.170286
Epoch 75, train loss: 0.170688
Epoch 76, train loss: 0.170752
Epoch 77, train loss: 0.170669
Epoch 78, train loss: 0.170567
Epoch 79, train loss: 0.170370
Epoch 80, train loss: 0.170612
Epoch 81, train loss: 0.170176
Epoch 82, train loss: 0.170138
Epoch 83, train loss: 0.169676
Epoch 84, train loss: 0.169269
Epoch 85, train loss: 0.168899
Epoch 86, train loss: 0.168366
Epoch 87, train loss: 0.167793
Epoch 88, train loss: 0.167489
Epoch 89, train loss: 0.166580
Epoch 90, train loss: 0.165468
Epoch 91, train loss: 0.164551
Epoch 92, train loss: 0.163725
Epoch 93, train loss: 0.162911
Epoch 94, train loss: 0.161777
Epoch 95, train loss: 0.161128
Epoch 96, train loss: 0.160098
Epoch 97, train loss: 0.159099
Epoch 98, train loss: 0.158367
Epoch 99, train loss: 0.157411
Test loss: 0.157693
Epoch 36, train loss: 0.315639
Epoch 37, train loss: 0.289757
Epoch 38, train loss: 0.267210
Epoch 39, train loss: 0.247690
Epoch 40, train loss: 0.231742
Epoch 41, train loss: 0.218038
Epoch 42, train loss: 0.206975
Epoch 43, train loss: 0.198374
Epoch 44, train loss: 0.191725
Epoch 45, train loss: 0.186724
Epoch 46, train loss: 0.182857
Epoch 47, train loss: 0.180079
Epoch 48, train loss: 0.177839
Epoch 49, train loss: 0.176199
Epoch 50, train loss: 0.174842
Epoch 51, train loss: 0.173806
Epoch 52, train loss: 0.172872
Epoch 53, train loss: 0.172130
Epoch 54, train loss: 0.171478
Epoch 55, train loss: 0.170884
Epoch 56, train loss: 0.170391
Epoch 57, train loss: 0.169879
Epoch 58, train loss: 0.169578
Epoch 59, train loss: 0.169307
Epoch 60, train loss: 0.169125
Epoch 61, train loss: 0.168945
Epoch 62, train loss: 0.168910
Epoch 63, train loss: 0.168853
Epoch 64, train loss: 0.168909
Epoch 65, train loss: 0.169032
Epoch 66, train loss: 0.169032
Epoch 67, train loss: 0.169377
Epoch 68, train loss: 0.169433
Epoch 69, train loss: 0.169524
Epoch 70, train loss: 0.169769
Epoch 71, train loss: 0.170056
Epoch 72, train loss: 0.170195
Epoch 73, train loss: 0.170234
Epoch 74, train loss: 0.170550
Epoch 75, train loss: 0.170552
Epoch 76, train loss: 0.170654
Epoch 77, train loss: 0.170635
Epoch 78, train loss: 0.170505
Epoch 79, train loss: 0.170486
Epoch 80, train loss: 0.170320
Epoch 81, train loss: 0.170153
Epoch 82, train loss: 0.169913
Epoch 83, train loss: 0.169829
Epoch 84, train loss: 0.169471
Epoch 85, train loss: 0.168830
Epoch 86, train loss: 0.168484
Epoch 87, train loss: 0.167642
Epoch 88, train loss: 0.167538
Epoch 89, train loss: 0.166975
Epoch 90, train loss: 0.166601
Epoch 91, train loss: 0.165936
Epoch 92, train loss: 0.165260
Epoch 93, train loss: 0.164895
Epoch 94, train loss: 0.164293
Epoch 95, train loss: 0.163647
Epoch 96, train loss: 0.163156
Epoch 97, train loss: 0.162578
Epoch 98, train loss: 0.161999
Epoch 99, train loss: 0.161564
Test loss: 0.088118
Epoch 36, train loss: 0.324567
Epoch 37, train loss: 0.296717
Epoch 38, train loss: 0.272953
Epoch 39, train loss: 0.252290
Epoch 40, train loss: 0.234810
Epoch 41, train loss: 0.220023
Epoch 42, train loss: 0.208372
Epoch 43, train loss: 0.198931
Epoch 44, train loss: 0.191578
Epoch 45, train loss: 0.185920
Epoch 46, train loss: 0.181773
Epoch 47, train loss: 0.178710
Epoch 48, train loss: 0.176387
Epoch 49, train loss: 0.174716
Epoch 50, train loss: 0.173445
Epoch 51, train loss: 0.172408
Epoch 52, train loss: 0.171608
Epoch 53, train loss: 0.170910
Epoch 54, train loss: 0.170284
Epoch 55, train loss: 0.169780
Epoch 56, train loss: 0.169316
Epoch 57, train loss: 0.168953
Epoch 58, train loss: 0.168579
Epoch 59, train loss: 0.168229
Epoch 60, train loss: 0.167969
Epoch 61, train loss: 0.167740
Epoch 62, train loss: 0.167579
Epoch 63, train loss: 0.167505
Epoch 64, train loss: 0.167329
Epoch 65, train loss: 0.167323
Epoch 66, train loss: 0.167132
Epoch 67, train loss: 0.167127
Epoch 68, train loss: 0.167275
Epoch 69, train loss: 0.167206
Epoch 70, train loss: 0.167087
Epoch 71, train loss: 0.166992
Epoch 72, train loss: 0.167069
Epoch 73, train loss: 0.167000
Epoch 74, train loss: 0.166890
Epoch 75, train loss: 0.166709
Epoch 76, train loss: 0.166572
Epoch 77, train loss: 0.166199
Epoch 78, train loss: 0.165916
Epoch 79, train loss: 0.165874
Epoch 80, train loss: 0.165654
Epoch 81, train loss: 0.165180
Epoch 82, train loss: 0.164586
Epoch 83, train loss: 0.164045
Epoch 84, train loss: 0.163170
Epoch 85, train loss: 0.162310
Epoch 86, train loss: 0.161364
Epoch 87, train loss: 0.160418
Epoch 88, train loss: 0.159527
Epoch 89, train loss: 0.158392
Epoch 90, train loss: 0.157249
Epoch 91, train loss: 0.156257
Epoch 92, train loss: 0.155201
Epoch 93, train loss: 0.154257
Epoch 94, train loss: 0.153454
Epoch 95, train loss: 0.152286
Epoch 96, train loss: 0.151289
Epoch 97, train loss: 0.150182
Epoch 98, train loss: 0.149439
Epoch 99, train loss: 0.148586
Test loss: 0.131478
Epoch 36, train loss: 0.328266
Epoch 37, train loss: 0.300846
Epoch 38, train loss: 0.276333
Epoch 39, train loss: 0.255414
Epoch 40, train loss: 0.237929
Epoch 41, train loss: 0.222858
Epoch 42, train loss: 0.210318
Epoch 43, train loss: 0.200326
Epoch 44, train loss: 0.192536
Epoch 45, train loss: 0.186656
Epoch 46, train loss: 0.182257
Epoch 47, train loss: 0.179069
Epoch 48, train loss: 0.176747
Epoch 49, train loss: 0.174866
Epoch 50, train loss: 0.173417
Epoch 51, train loss: 0.172281
Epoch 52, train loss: 0.171347
Epoch 53, train loss: 0.170475
Epoch 54, train loss: 0.169782
Epoch 55, train loss: 0.169118
Epoch 56, train loss: 0.168533
Epoch 57, train loss: 0.168017
Epoch 58, train loss: 0.167670
Epoch 59, train loss: 0.167198
Epoch 60, train loss: 0.166949
Epoch 61, train loss: 0.166771
Epoch 62, train loss: 0.166588
Epoch 63, train loss: 0.166501
Epoch 64, train loss: 0.166434
Epoch 65, train loss: 0.166233
Epoch 66, train loss: 0.166270
Epoch 67, train loss: 0.166205
Epoch 68, train loss: 0.166186
Epoch 69, train loss: 0.166170
Epoch 70, train loss: 0.166019
Epoch 71, train loss: 0.165905
Epoch 72, train loss: 0.165750
Epoch 73, train loss: 0.165560
Epoch 74, train loss: 0.165333
Epoch 75, train loss: 0.164979
Epoch 76, train loss: 0.164527
Epoch 77, train loss: 0.164320
Epoch 78, train loss: 0.163748
Epoch 79, train loss: 0.163430
Epoch 80, train loss: 0.162879
Epoch 81, train loss: 0.162568
Epoch 82, train loss: 0.162048
Epoch 83, train loss: 0.161263
Epoch 84, train loss: 0.160732
Epoch 85, train loss: 0.160174
Epoch 86, train loss: 0.159534
Epoch 87, train loss: 0.158410
Epoch 88, train loss: 0.157482
Epoch 89, train loss: 0.156361
Epoch 90, train loss: 0.155530
Epoch 91, train loss: 0.154380
Epoch 92, train loss: 0.153466
Epoch 93, train loss: 0.152449
Epoch 94, train loss: 0.151287
Epoch 95, train loss: 0.150221
Epoch 96, train loss: 0.149259
Epoch 97, train loss: 0.148363
Epoch 98, train loss: 0.147633
Epoch 99, train loss: 0.146337
Test loss: 0.156803
Epoch 36, train loss: 0.356390
Epoch 37, train loss: 0.327691
Epoch 38, train loss: 0.301744
Epoch 39, train loss: 0.278889
Epoch 40, train loss: 0.258683
Epoch 41, train loss: 0.241847
Epoch 42, train loss: 0.226892
Epoch 43, train loss: 0.215165
Epoch 44, train loss: 0.205277
Epoch 45, train loss: 0.197139
Epoch 46, train loss: 0.190773
Epoch 47, train loss: 0.185900
Epoch 48, train loss: 0.182131
Epoch 49, train loss: 0.179349
Epoch 50, train loss: 0.177189
Epoch 51, train loss: 0.175564
Epoch 52, train loss: 0.174340
Epoch 53, train loss: 0.173388
Epoch 54, train loss: 0.172603
Epoch 55, train loss: 0.171980
Epoch 56, train loss: 0.171473
Epoch 57, train loss: 0.170957
Epoch 58, train loss: 0.170545
Epoch 59, train loss: 0.170134
Epoch 60, train loss: 0.169829
Epoch 61, train loss: 0.169504
Epoch 62, train loss: 0.169308
Epoch 63, train loss: 0.169051
Epoch 64, train loss: 0.168892
Epoch 65, train loss: 0.168660
Epoch 66, train loss: 0.168603
Epoch 67, train loss: 0.168642
Epoch 68, train loss: 0.168507
Epoch 69, train loss: 0.168526
Epoch 70, train loss: 0.168487
Epoch 71, train loss: 0.168391
Epoch 72, train loss: 0.168382
Epoch 73, train loss: 0.168386
Epoch 74, train loss: 0.168332
Epoch 75, train loss: 0.168313
Epoch 76, train loss: 0.168114
Epoch 77, train loss: 0.168103
Epoch 78, train loss: 0.168217
Epoch 79, train loss: 0.167836
Epoch 80, train loss: 0.167627
Epoch 81, train loss: 0.167476
Epoch 82, train loss: 0.167216
Epoch 83, train loss: 0.167074
Epoch 84, train loss: 0.166629
Epoch 85, train loss: 0.165987
Epoch 86, train loss: 0.165625
Epoch 87, train loss: 0.165168
Epoch 88, train loss: 0.164525
Epoch 89, train loss: 0.164183
Epoch 90, train loss: 0.163699
Epoch 91, train loss: 0.162975
Epoch 92, train loss: 0.162380
Epoch 93, train loss: 0.161826
Epoch 94, train loss: 0.161140
Epoch 95, train loss: 0.160485
Epoch 96, train loss: 0.159577
Epoch 97, train loss: 0.158842
Epoch 98, train loss: 0.157963
Epoch 99, train loss: 0.157262
Test loss: 0.098077
Epoch 36, train loss: 0.294957
Epoch 37, train loss: 0.271949
Epoch 38, train loss: 0.252121
Epoch 39, train loss: 0.235249
Epoch 40, train loss: 0.221106
Epoch 41, train loss: 0.209534
Epoch 42, train loss: 0.200251
Epoch 43, train loss: 0.192958
Epoch 44, train loss: 0.187372
Epoch 45, train loss: 0.183223
Epoch 46, train loss: 0.180165
Epoch 47, train loss: 0.177808
Epoch 48, train loss: 0.176139
Epoch 49, train loss: 0.174771
Epoch 50, train loss: 0.173752
Epoch 51, train loss: 0.172937
Epoch 52, train loss: 0.172284
Epoch 53, train loss: 0.171723
Epoch 54, train loss: 0.171250
Epoch 55, train loss: 0.170869
Epoch 56, train loss: 0.170482
Epoch 57, train loss: 0.170220
Epoch 58, train loss: 0.170017
Epoch 59, train loss: 0.169822
Epoch 60, train loss: 0.169690
Epoch 61, train loss: 0.169617
Epoch 62, train loss: 0.169720
Epoch 63, train loss: 0.169646
Epoch 64, train loss: 0.169781
Epoch 65, train loss: 0.169789
Epoch 66, train loss: 0.170013
Epoch 67, train loss: 0.170200
Epoch 68, train loss: 0.170376
Epoch 69, train loss: 0.170516
Epoch 70, train loss: 0.170851
Epoch 71, train loss: 0.170822
Epoch 72, train loss: 0.170934
Epoch 73, train loss: 0.171047
Epoch 74, train loss: 0.171238
Epoch 75, train loss: 0.171418
Epoch 76, train loss: 0.171291
Epoch 77, train loss: 0.171188
Epoch 78, train loss: 0.171040
Epoch 79, train loss: 0.170948
Epoch 80, train loss: 0.170605
Epoch 81, train loss: 0.170372
Epoch 82, train loss: 0.169933
Epoch 83, train loss: 0.169525
Epoch 84, train loss: 0.168961
Epoch 85, train loss: 0.168594
Epoch 86, train loss: 0.168268
Epoch 87, train loss: 0.167474
Epoch 88, train loss: 0.166900
Epoch 89, train loss: 0.166459
Epoch 90, train loss: 0.166057
Epoch 91, train loss: 0.165569
Epoch 92, train loss: 0.164821
Epoch 93, train loss: 0.163872
Epoch 94, train loss: 0.163544
Epoch 95, train loss: 0.162814
Epoch 96, train loss: 0.162126
Epoch 97, train loss: 0.161468
Epoch 98, train loss: 0.160785
Epoch 99, train loss: 0.159920
Test loss: 0.105743
Epoch 36, train loss: 0.281748
Epoch 37, train loss: 0.260239
Epoch 38, train loss: 0.241894
Epoch 39, train loss: 0.226546
Epoch 40, train loss: 0.213658
Epoch 41, train loss: 0.203563
Epoch 42, train loss: 0.195443
Epoch 43, train loss: 0.188966
Epoch 44, train loss: 0.184000
Epoch 45, train loss: 0.180311
Epoch 46, train loss: 0.177604
Epoch 47, train loss: 0.175529
Epoch 48, train loss: 0.174003
Epoch 49, train loss: 0.172817
Epoch 50, train loss: 0.171895
Epoch 51, train loss: 0.171176
Epoch 52, train loss: 0.170558
Epoch 53, train loss: 0.170025
Epoch 54, train loss: 0.169622
Epoch 55, train loss: 0.169323
Epoch 56, train loss: 0.168926
Epoch 57, train loss: 0.168726
Epoch 58, train loss: 0.168518
Epoch 59, train loss: 0.168354
Epoch 60, train loss: 0.168257
Epoch 61, train loss: 0.168260
Epoch 62, train loss: 0.168203
Epoch 63, train loss: 0.168321
Epoch 64, train loss: 0.168312
Epoch 65, train loss: 0.168437
Epoch 66, train loss: 0.168542
Epoch 67, train loss: 0.168734
Epoch 68, train loss: 0.169012
Epoch 69, train loss: 0.169004
Epoch 70, train loss: 0.169163
Epoch 71, train loss: 0.169359
Epoch 72, train loss: 0.169416
Epoch 73, train loss: 0.169498
Epoch 74, train loss: 0.169692
Epoch 75, train loss: 0.169564
Epoch 76, train loss: 0.169928
Epoch 77, train loss: 0.169744
Epoch 78, train loss: 0.169438
Epoch 79, train loss: 0.169224
Epoch 80, train loss: 0.169144
Epoch 81, train loss: 0.169048
Epoch 82, train loss: 0.168517
Epoch 83, train loss: 0.168225
Epoch 84, train loss: 0.167893
Epoch 85, train loss: 0.167133
Epoch 86, train loss: 0.166545
Epoch 87, train loss: 0.165776
Epoch 88, train loss: 0.164856
Epoch 89, train loss: 0.164137
Epoch 90, train loss: 0.163399
Epoch 91, train loss: 0.162646
Epoch 92, train loss: 0.161976
Epoch 93, train loss: 0.160991
Epoch 94, train loss: 0.160355
Epoch 95, train loss: 0.159415
Epoch 96, train loss: 0.158634
Epoch 97, train loss: 0.158022
Epoch 98, train loss: 0.157473
Epoch 99, train loss: 0.156637
Test loss: 0.139715
Epoch 36, train loss: 0.283853
Epoch 37, train loss: 0.261189
Epoch 38, train loss: 0.241856
Epoch 39, train loss: 0.225834
Epoch 40, train loss: 0.212577
Epoch 41, train loss: 0.202003
Epoch 42, train loss: 0.194041
Epoch 43, train loss: 0.187820
Epoch 44, train loss: 0.183337
Epoch 45, train loss: 0.179850
Epoch 46, train loss: 0.177377
Epoch 47, train loss: 0.175510
Epoch 48, train loss: 0.174072
Epoch 49, train loss: 0.172995
Epoch 50, train loss: 0.172142
Epoch 51, train loss: 0.171405
Epoch 52, train loss: 0.170809
Epoch 53, train loss: 0.170308
Epoch 54, train loss: 0.169815
Epoch 55, train loss: 0.169513
Epoch 56, train loss: 0.169240
Epoch 57, train loss: 0.168940
Epoch 58, train loss: 0.168734
Epoch 59, train loss: 0.168588
Epoch 60, train loss: 0.168491
Epoch 61, train loss: 0.168538
Epoch 62, train loss: 0.168582
Epoch 63, train loss: 0.168680
Epoch 64, train loss: 0.168791
Epoch 65, train loss: 0.168876
Epoch 66, train loss: 0.169047
Epoch 67, train loss: 0.169258
Epoch 68, train loss: 0.169595
Epoch 69, train loss: 0.169688
Epoch 70, train loss: 0.169726
Epoch 71, train loss: 0.170115
Epoch 72, train loss: 0.170336
Epoch 73, train loss: 0.170363
Epoch 74, train loss: 0.170466
Epoch 75, train loss: 0.170432
Epoch 76, train loss: 0.170440
Epoch 77, train loss: 0.170308
Epoch 78, train loss: 0.170129
Epoch 79, train loss: 0.169666
Epoch 80, train loss: 0.169777
Epoch 81, train loss: 0.169137
Epoch 82, train loss: 0.168832
Epoch 83, train loss: 0.168352
Epoch 84, train loss: 0.167987
Epoch 85, train loss: 0.167292
Epoch 86, train loss: 0.167007
Epoch 87, train loss: 0.166108
Epoch 88, train loss: 0.165340
Epoch 89, train loss: 0.164588
Epoch 90, train loss: 0.163983
Epoch 91, train loss: 0.163524
Epoch 92, train loss: 0.162848
Epoch 93, train loss: 0.162174
Epoch 94, train loss: 0.161557
Epoch 95, train loss: 0.160329
Epoch 96, train loss: 0.160019
Epoch 97, train loss: 0.159296
Epoch 98, train loss: 0.158446
Epoch 99, train loss: 0.157856
Test loss: 0.152647
Epoch 36, train loss: 0.317605
Epoch 37, train loss: 0.291307
Epoch 38, train loss: 0.268330
Epoch 39, train loss: 0.248876
Epoch 40, train loss: 0.232555
Epoch 41, train loss: 0.218964
Epoch 42, train loss: 0.207627
Epoch 43, train loss: 0.198629
Epoch 44, train loss: 0.191667
Epoch 45, train loss: 0.186199
Epoch 46, train loss: 0.182040
Epoch 47, train loss: 0.178857
Epoch 48, train loss: 0.176437
Epoch 49, train loss: 0.174739
Epoch 50, train loss: 0.173416
Epoch 51, train loss: 0.172421
Epoch 52, train loss: 0.171652
Epoch 53, train loss: 0.171039
Epoch 54, train loss: 0.170501
Epoch 55, train loss: 0.170077
Epoch 56, train loss: 0.169684
Epoch 57, train loss: 0.169501
Epoch 58, train loss: 0.169253
Epoch 59, train loss: 0.169048
Epoch 60, train loss: 0.168866
Epoch 61, train loss: 0.168831
Epoch 62, train loss: 0.168828
Epoch 63, train loss: 0.168709
Epoch 64, train loss: 0.168810
Epoch 65, train loss: 0.169007
Epoch 66, train loss: 0.169002
Epoch 67, train loss: 0.169133
Epoch 68, train loss: 0.169222
Epoch 69, train loss: 0.169410
Epoch 70, train loss: 0.169692
Epoch 71, train loss: 0.169914
Epoch 72, train loss: 0.169968
Epoch 73, train loss: 0.170161
Epoch 74, train loss: 0.170191
Epoch 75, train loss: 0.170350
Epoch 76, train loss: 0.170773
Epoch 77, train loss: 0.170449
Epoch 78, train loss: 0.170581
Epoch 79, train loss: 0.170545
Epoch 80, train loss: 0.170302
Epoch 81, train loss: 0.170227
Epoch 82, train loss: 0.170101
Epoch 83, train loss: 0.169597
Epoch 84, train loss: 0.169337
Epoch 85, train loss: 0.168886
Epoch 86, train loss: 0.168706
Epoch 87, train loss: 0.168181
Epoch 88, train loss: 0.167345
Epoch 89, train loss: 0.166646
Epoch 90, train loss: 0.166209
Epoch 91, train loss: 0.165240
Epoch 92, train loss: 0.164227
Epoch 93, train loss: 0.163235
Epoch 94, train loss: 0.162380
Epoch 95, train loss: 0.161660
Epoch 96, train loss: 0.160767
Epoch 97, train loss: 0.160191
Epoch 98, train loss: 0.159111
Epoch 99, train loss: 0.158153
Test loss: 0.097210
Epoch 36, train loss: 0.329082
Epoch 37, train loss: 0.302380
Epoch 38, train loss: 0.279473
Epoch 39, train loss: 0.259265
Epoch 40, train loss: 0.241820
Epoch 41, train loss: 0.227495
Epoch 42, train loss: 0.215281
Epoch 43, train loss: 0.205221
Epoch 44, train loss: 0.197132
Epoch 45, train loss: 0.190833
Epoch 46, train loss: 0.185889
Epoch 47, train loss: 0.182071
Epoch 48, train loss: 0.179361
Epoch 49, train loss: 0.177158
Epoch 50, train loss: 0.175490
Epoch 51, train loss: 0.174184
Epoch 52, train loss: 0.173146
Epoch 53, train loss: 0.172270
Epoch 54, train loss: 0.171491
Epoch 55, train loss: 0.170869
Epoch 56, train loss: 0.170306
Epoch 57, train loss: 0.169833
Epoch 58, train loss: 0.169409
Epoch 59, train loss: 0.169117
Epoch 60, train loss: 0.168797
Epoch 61, train loss: 0.168780
Epoch 62, train loss: 0.168694
Epoch 63, train loss: 0.168861
Epoch 64, train loss: 0.168829
Epoch 65, train loss: 0.168999
Epoch 66, train loss: 0.169182
Epoch 67, train loss: 0.169524
Epoch 68, train loss: 0.169745
Epoch 69, train loss: 0.169962
Epoch 70, train loss: 0.170356
Epoch 71, train loss: 0.170587
Epoch 72, train loss: 0.170621
Epoch 73, train loss: 0.170860
Epoch 74, train loss: 0.171203
Epoch 75, train loss: 0.171339
Epoch 76, train loss: 0.171465
Epoch 77, train loss: 0.171222
Epoch 78, train loss: 0.171212
Epoch 79, train loss: 0.171348
Epoch 80, train loss: 0.171290
Epoch 81, train loss: 0.170651
Epoch 82, train loss: 0.170222
Epoch 83, train loss: 0.169801
Epoch 84, train loss: 0.169558
Epoch 85, train loss: 0.168988
Epoch 86, train loss: 0.168491
Epoch 87, train loss: 0.167840
Epoch 88, train loss: 0.167366
Epoch 89, train loss: 0.166380
Epoch 90, train loss: 0.165537
Epoch 91, train loss: 0.164578
Epoch 92, train loss: 0.163828
Epoch 93, train loss: 0.162819
Epoch 94, train loss: 0.161908
Epoch 95, train loss: 0.160838
Epoch 96, train loss: 0.160002
Epoch 97, train loss: 0.158942
Epoch 98, train loss: 0.158001
Epoch 99, train loss: 0.157302
Test loss: 0.106548
Epoch 36, train loss: 0.346587
Epoch 37, train loss: 0.318635
Epoch 38, train loss: 0.293804
Epoch 39, train loss: 0.271900
Epoch 40, train loss: 0.252634
Epoch 41, train loss: 0.236368
Epoch 42, train loss: 0.222794
Epoch 43, train loss: 0.211333
Epoch 44, train loss: 0.202019
Epoch 45, train loss: 0.194595
Epoch 46, train loss: 0.188666
Epoch 47, train loss: 0.184068
Epoch 48, train loss: 0.180551
Epoch 49, train loss: 0.177790
Epoch 50, train loss: 0.175694
Epoch 51, train loss: 0.174128
Epoch 52, train loss: 0.172886
Epoch 53, train loss: 0.171941
Epoch 54, train loss: 0.171183
Epoch 55, train loss: 0.170621
Epoch 56, train loss: 0.170129
Epoch 57, train loss: 0.169757
Epoch 58, train loss: 0.169462
Epoch 59, train loss: 0.169215
Epoch 60, train loss: 0.168995
Epoch 61, train loss: 0.168801
Epoch 62, train loss: 0.168762
Epoch 63, train loss: 0.168797
Epoch 64, train loss: 0.168748
Epoch 65, train loss: 0.168910
Epoch 66, train loss: 0.168977
Epoch 67, train loss: 0.169005
Epoch 68, train loss: 0.169141
Epoch 69, train loss: 0.169450
Epoch 70, train loss: 0.169759
Epoch 71, train loss: 0.169963
Epoch 72, train loss: 0.170066
Epoch 73, train loss: 0.170369
Epoch 74, train loss: 0.170839
Epoch 75, train loss: 0.171016
Epoch 76, train loss: 0.171276
Epoch 77, train loss: 0.171521
Epoch 78, train loss: 0.171840
Epoch 79, train loss: 0.172096
Epoch 80, train loss: 0.172104
Epoch 81, train loss: 0.172011
Epoch 82, train loss: 0.171494
Epoch 83, train loss: 0.171188
Epoch 84, train loss: 0.170702
Epoch 85, train loss: 0.170074
Epoch 86, train loss: 0.169259
Epoch 87, train loss: 0.168330
Epoch 88, train loss: 0.167671
Epoch 89, train loss: 0.166486
Epoch 90, train loss: 0.165287
Epoch 91, train loss: 0.164241
Epoch 92, train loss: 0.163195
Epoch 93, train loss: 0.162054
Epoch 94, train loss: 0.160909
Epoch 95, train loss: 0.159662
Epoch 96, train loss: 0.158484
Epoch 97, train loss: 0.157275
Epoch 98, train loss: 0.156308
Epoch 99, train loss: 0.155126
Test loss: 0.170024
Epoch 36, train loss: 0.332762
Epoch 37, train loss: 0.306608
Epoch 38, train loss: 0.282797
Epoch 39, train loss: 0.262377
Epoch 40, train loss: 0.244645
Epoch 41, train loss: 0.229351
Epoch 42, train loss: 0.216722
Epoch 43, train loss: 0.206303
Epoch 44, train loss: 0.197897
Epoch 45, train loss: 0.191502
Epoch 46, train loss: 0.186251
Epoch 47, train loss: 0.182262
Epoch 48, train loss: 0.179257
Epoch 49, train loss: 0.177071
Epoch 50, train loss: 0.175393
Epoch 51, train loss: 0.174113
Epoch 52, train loss: 0.173122
Epoch 53, train loss: 0.172327
Epoch 54, train loss: 0.171615
Epoch 55, train loss: 0.171033
Epoch 56, train loss: 0.170521
Epoch 57, train loss: 0.170011
Epoch 58, train loss: 0.169654
Epoch 59, train loss: 0.169302
Epoch 60, train loss: 0.168920
Epoch 61, train loss: 0.168599
Epoch 62, train loss: 0.168332
Epoch 63, train loss: 0.168194
Epoch 64, train loss: 0.168006
Epoch 65, train loss: 0.167922
Epoch 66, train loss: 0.167880
Epoch 67, train loss: 0.167820
Epoch 68, train loss: 0.167756
Epoch 69, train loss: 0.167656
Epoch 70, train loss: 0.167572
Epoch 71, train loss: 0.167632
Epoch 72, train loss: 0.167648
Epoch 73, train loss: 0.167794
Epoch 74, train loss: 0.167650
Epoch 75, train loss: 0.167454
Epoch 76, train loss: 0.167392
Epoch 77, train loss: 0.167235
Epoch 78, train loss: 0.167142
Epoch 79, train loss: 0.167190
Epoch 80, train loss: 0.166744
Epoch 81, train loss: 0.166480
Epoch 82, train loss: 0.166105
Epoch 83, train loss: 0.165734
Epoch 84, train loss: 0.165127
Epoch 85, train loss: 0.164544
Epoch 86, train loss: 0.163835
Epoch 87, train loss: 0.163188
Epoch 88, train loss: 0.162283
Epoch 89, train loss: 0.161731
Epoch 90, train loss: 0.160818
Epoch 91, train loss: 0.160159
Epoch 92, train loss: 0.159155
Epoch 93, train loss: 0.158434
Epoch 94, train loss: 0.157656
Epoch 95, train loss: 0.156755
Epoch 96, train loss: 0.156127
Epoch 97, train loss: 0.155431
Epoch 98, train loss: 0.154367
Epoch 99, train loss: 0.153647
Test loss: 0.132933
Epoch 36, train loss: 0.275242
Epoch 37, train loss: 0.254839
Epoch 38, train loss: 0.237383
Epoch 39, train loss: 0.223273
Epoch 40, train loss: 0.211129
Epoch 41, train loss: 0.201768
Epoch 42, train loss: 0.194222
Epoch 43, train loss: 0.188481
Epoch 44, train loss: 0.183934
Epoch 45, train loss: 0.180564
Epoch 46, train loss: 0.178086
Epoch 47, train loss: 0.176190
Epoch 48, train loss: 0.174719
Epoch 49, train loss: 0.173644
Epoch 50, train loss: 0.172787
Epoch 51, train loss: 0.172095
Epoch 52, train loss: 0.171491
Epoch 53, train loss: 0.171047
Epoch 54, train loss: 0.170600
Epoch 55, train loss: 0.170300
Epoch 56, train loss: 0.170010
Epoch 57, train loss: 0.169836
Epoch 58, train loss: 0.169606
Epoch 59, train loss: 0.169529
Epoch 60, train loss: 0.169442
Epoch 61, train loss: 0.169511
Epoch 62, train loss: 0.169562
Epoch 63, train loss: 0.169673
Epoch 64, train loss: 0.169806
Epoch 65, train loss: 0.169873
Epoch 66, train loss: 0.170109
Epoch 67, train loss: 0.170179
Epoch 68, train loss: 0.170510
Epoch 69, train loss: 0.170831
Epoch 70, train loss: 0.170784
Epoch 71, train loss: 0.170863
Epoch 72, train loss: 0.170974
Epoch 73, train loss: 0.170998
Epoch 74, train loss: 0.171058
Epoch 75, train loss: 0.171234
Epoch 76, train loss: 0.171345
Epoch 77, train loss: 0.171072
Epoch 78, train loss: 0.170888
Epoch 79, train loss: 0.170576
Epoch 80, train loss: 0.170364
Epoch 81, train loss: 0.170283
Epoch 82, train loss: 0.169937
Epoch 83, train loss: 0.169383
Epoch 84, train loss: 0.169014
Epoch 85, train loss: 0.168445
Epoch 86, train loss: 0.167630
Epoch 87, train loss: 0.167106
Epoch 88, train loss: 0.166851
Epoch 89, train loss: 0.166018
Epoch 90, train loss: 0.165147
Epoch 91, train loss: 0.164249
Epoch 92, train loss: 0.163357
Epoch 93, train loss: 0.162749
Epoch 94, train loss: 0.161800
Epoch 95, train loss: 0.160734
Epoch 96, train loss: 0.160107
Epoch 97, train loss: 0.159130
Epoch 98, train loss: 0.158164
Epoch 99, train loss: 0.157253
Test loss: 0.121222
Epoch 36, train loss: 0.294294
Epoch 37, train loss: 0.270288
Epoch 38, train loss: 0.249487
Epoch 39, train loss: 0.231916
Epoch 40, train loss: 0.217336
Epoch 41, train loss: 0.205586
Epoch 42, train loss: 0.196398
Epoch 43, train loss: 0.189442
Epoch 44, train loss: 0.184144
Epoch 45, train loss: 0.180241
Epoch 46, train loss: 0.177436
Epoch 47, train loss: 0.175362
Epoch 48, train loss: 0.173799
Epoch 49, train loss: 0.172593
Epoch 50, train loss: 0.171690
Epoch 51, train loss: 0.170954
Epoch 52, train loss: 0.170311
Epoch 53, train loss: 0.169788
Epoch 54, train loss: 0.169241
Epoch 55, train loss: 0.168792
Epoch 56, train loss: 0.168497
Epoch 57, train loss: 0.168215
Epoch 58, train loss: 0.167894
Epoch 59, train loss: 0.167714
Epoch 60, train loss: 0.167559
Epoch 61, train loss: 0.167467
Epoch 62, train loss: 0.167435
Epoch 63, train loss: 0.167269
Epoch 64, train loss: 0.167315
Epoch 65, train loss: 0.167409
Epoch 66, train loss: 0.167354
Epoch 67, train loss: 0.167401
Epoch 68, train loss: 0.167479
Epoch 69, train loss: 0.167473
Epoch 70, train loss: 0.167395
Epoch 71, train loss: 0.167227
Epoch 72, train loss: 0.167053
Epoch 73, train loss: 0.166878
Epoch 74, train loss: 0.166634
Epoch 75, train loss: 0.166365
Epoch 76, train loss: 0.165662
Epoch 77, train loss: 0.165129
Epoch 78, train loss: 0.164667
Epoch 79, train loss: 0.164243
Epoch 80, train loss: 0.163597
Epoch 81, train loss: 0.162782
Epoch 82, train loss: 0.162189
Epoch 83, train loss: 0.161497
Epoch 84, train loss: 0.161004
Epoch 85, train loss: 0.160095
Epoch 86, train loss: 0.159028
Epoch 87, train loss: 0.158413
Epoch 88, train loss: 0.157772
Epoch 89, train loss: 0.157016
Epoch 90, train loss: 0.156153
Epoch 91, train loss: 0.155400
Epoch 92, train loss: 0.154705
Epoch 93, train loss: 0.153963
Epoch 94, train loss: 0.153116
Epoch 95, train loss: 0.152344
Epoch 96, train loss: 0.151557
Epoch 97, train loss: 0.151148
Epoch 98, train loss: 0.150476
Epoch 99, train loss: 0.149518
Test loss: 0.140113
Epoch 36, train loss: 0.320082
Epoch 37, train loss: 0.293250
Epoch 38, train loss: 0.269181
Epoch 39, train loss: 0.249103
Epoch 40, train loss: 0.231888
Epoch 41, train loss: 0.217077
Epoch 42, train loss: 0.205361
Epoch 43, train loss: 0.196016
Epoch 44, train loss: 0.188856
Epoch 45, train loss: 0.183447
Epoch 46, train loss: 0.179312
Epoch 47, train loss: 0.176567
Epoch 48, train loss: 0.174550
Epoch 49, train loss: 0.172960
Epoch 50, train loss: 0.171869
Epoch 51, train loss: 0.170929
Epoch 52, train loss: 0.170175
Epoch 53, train loss: 0.169531
Epoch 54, train loss: 0.168939
Epoch 55, train loss: 0.168353
Epoch 56, train loss: 0.167990
Epoch 57, train loss: 0.167529
Epoch 58, train loss: 0.167059
Epoch 59, train loss: 0.166757
Epoch 60, train loss: 0.166509
Epoch 61, train loss: 0.166172
Epoch 62, train loss: 0.165998
Epoch 63, train loss: 0.165860
Epoch 64, train loss: 0.165616
Epoch 65, train loss: 0.165421
Epoch 66, train loss: 0.165202
Epoch 67, train loss: 0.165125
Epoch 68, train loss: 0.164842
Epoch 69, train loss: 0.164597
Epoch 70, train loss: 0.164304
Epoch 71, train loss: 0.164172
Epoch 72, train loss: 0.163774
Epoch 73, train loss: 0.163563
Epoch 74, train loss: 0.162941
Epoch 75, train loss: 0.162477
Epoch 76, train loss: 0.161878
Epoch 77, train loss: 0.161370
Epoch 78, train loss: 0.160802
Epoch 79, train loss: 0.159980
Epoch 80, train loss: 0.159246
Epoch 81, train loss: 0.158498
Epoch 82, train loss: 0.157666
Epoch 83, train loss: 0.156776
Epoch 84, train loss: 0.155828
Epoch 85, train loss: 0.155109
Epoch 86, train loss: 0.154294
Epoch 87, train loss: 0.153425
Epoch 88, train loss: 0.152439
Epoch 89, train loss: 0.151771
Epoch 90, train loss: 0.150931
Epoch 91, train loss: 0.149979
Epoch 92, train loss: 0.149196
Epoch 93, train loss: 0.148342
Epoch 94, train loss: 0.147546
Epoch 95, train loss: 0.146526
Epoch 96, train loss: 0.145586
Epoch 97, train loss: 0.144741
Epoch 98, train loss: 0.143879
Epoch 99, train loss: 0.142987
Test loss: 0.199252
Epoch 36, train loss: 0.283046
Epoch 37, train loss: 0.260617
Epoch 38, train loss: 0.241674
Epoch 39, train loss: 0.226252
Epoch 40, train loss: 0.213583
Epoch 41, train loss: 0.203147
Epoch 42, train loss: 0.195302
Epoch 43, train loss: 0.189049
Epoch 44, train loss: 0.184403
Epoch 45, train loss: 0.180810
Epoch 46, train loss: 0.178227
Epoch 47, train loss: 0.176249
Epoch 48, train loss: 0.174695
Epoch 49, train loss: 0.173554
Epoch 50, train loss: 0.172696
Epoch 51, train loss: 0.172012
Epoch 52, train loss: 0.171453
Epoch 53, train loss: 0.170988
Epoch 54, train loss: 0.170548
Epoch 55, train loss: 0.170271
Epoch 56, train loss: 0.169988
Epoch 57, train loss: 0.169833
Epoch 58, train loss: 0.169565
Epoch 59, train loss: 0.169525
Epoch 60, train loss: 0.169496
Epoch 61, train loss: 0.169471
Epoch 62, train loss: 0.169507
Epoch 63, train loss: 0.169671
Epoch 64, train loss: 0.169894
Epoch 65, train loss: 0.169911
Epoch 66, train loss: 0.170091
Epoch 67, train loss: 0.170347
Epoch 68, train loss: 0.170616
Epoch 69, train loss: 0.170740
Epoch 70, train loss: 0.171121
Epoch 71, train loss: 0.171162
Epoch 72, train loss: 0.171122
Epoch 73, train loss: 0.171322
Epoch 74, train loss: 0.171482
Epoch 75, train loss: 0.171209
Epoch 76, train loss: 0.171053
Epoch 77, train loss: 0.170773
Epoch 78, train loss: 0.170470
Epoch 79, train loss: 0.170119
Epoch 80, train loss: 0.169675
Epoch 81, train loss: 0.169327
Epoch 82, train loss: 0.168819
Epoch 83, train loss: 0.168304
Epoch 84, train loss: 0.167467
Epoch 85, train loss: 0.167040
Epoch 86, train loss: 0.165932
Epoch 87, train loss: 0.164939
Epoch 88, train loss: 0.163992
Epoch 89, train loss: 0.163264
Epoch 90, train loss: 0.162797
Epoch 91, train loss: 0.161943
Epoch 92, train loss: 0.160873
Epoch 93, train loss: 0.159997
Epoch 94, train loss: 0.159034
Epoch 95, train loss: 0.158057
Epoch 96, train loss: 0.157679
Epoch 97, train loss: 0.156874
Epoch 98, train loss: 0.155857
Epoch 99, train loss: 0.155228
Test loss: 0.230139
Epoch 36, train loss: 0.251098
Epoch 37, train loss: 0.233876
Epoch 38, train loss: 0.219818
Epoch 39, train loss: 0.208273
Epoch 40, train loss: 0.199103
Epoch 41, train loss: 0.191946
Epoch 42, train loss: 0.186482
Epoch 43, train loss: 0.182287
Epoch 44, train loss: 0.179077
Epoch 45, train loss: 0.176718
Epoch 46, train loss: 0.174891
Epoch 47, train loss: 0.173511
Epoch 48, train loss: 0.172527
Epoch 49, train loss: 0.171762
Epoch 50, train loss: 0.171136
Epoch 51, train loss: 0.170622
Epoch 52, train loss: 0.170201
Epoch 53, train loss: 0.169808
Epoch 54, train loss: 0.169544
Epoch 55, train loss: 0.169286
Epoch 56, train loss: 0.169229
Epoch 57, train loss: 0.169035
Epoch 58, train loss: 0.169016
Epoch 59, train loss: 0.168951
Epoch 60, train loss: 0.168998
Epoch 61, train loss: 0.169143
Epoch 62, train loss: 0.169271
Epoch 63, train loss: 0.169449
Epoch 64, train loss: 0.169591
Epoch 65, train loss: 0.169952
Epoch 66, train loss: 0.170000
Epoch 67, train loss: 0.170179
Epoch 68, train loss: 0.170378
Epoch 69, train loss: 0.170639
Epoch 70, train loss: 0.170885
Epoch 71, train loss: 0.170829
Epoch 72, train loss: 0.171050
Epoch 73, train loss: 0.170925
Epoch 74, train loss: 0.171054
Epoch 75, train loss: 0.171017
Epoch 76, train loss: 0.170377
Epoch 77, train loss: 0.170200
Epoch 78, train loss: 0.169568
Epoch 79, train loss: 0.169221
Epoch 80, train loss: 0.168416
Epoch 81, train loss: 0.167860
Epoch 82, train loss: 0.166924
Epoch 83, train loss: 0.166526
Epoch 84, train loss: 0.165496
Epoch 85, train loss: 0.164662
Epoch 86, train loss: 0.163928
Epoch 87, train loss: 0.163204
Epoch 88, train loss: 0.161963
Epoch 89, train loss: 0.161143
Epoch 90, train loss: 0.160043
Epoch 91, train loss: 0.159339
Epoch 92, train loss: 0.158156
Epoch 93, train loss: 0.157108
Epoch 94, train loss: 0.156169
Epoch 95, train loss: 0.155211
Epoch 96, train loss: 0.154228
Epoch 97, train loss: 0.153714
Epoch 98, train loss: 0.152998
Epoch 99, train loss: 0.152148
Test loss: 0.126995
Epoch 36, train loss: 0.316846
Epoch 37, train loss: 0.291067
Epoch 38, train loss: 0.268304
Epoch 39, train loss: 0.249131
Epoch 40, train loss: 0.232604
Epoch 41, train loss: 0.219158
Epoch 42, train loss: 0.208133
Epoch 43, train loss: 0.199588
Epoch 44, train loss: 0.192226
Epoch 45, train loss: 0.187006
Epoch 46, train loss: 0.182718
Epoch 47, train loss: 0.179561
Epoch 48, train loss: 0.176980
Epoch 49, train loss: 0.175211
Epoch 50, train loss: 0.173865
Epoch 51, train loss: 0.172767
Epoch 52, train loss: 0.171946
Epoch 53, train loss: 0.171273
Epoch 54, train loss: 0.170704
Epoch 55, train loss: 0.170230
Epoch 56, train loss: 0.169817
Epoch 57, train loss: 0.169396
Epoch 58, train loss: 0.169112
Epoch 59, train loss: 0.168842
Epoch 60, train loss: 0.168573
Epoch 61, train loss: 0.168441
Epoch 62, train loss: 0.168274
Epoch 63, train loss: 0.168132
Epoch 64, train loss: 0.168063
Epoch 65, train loss: 0.168017
Epoch 66, train loss: 0.167992
Epoch 67, train loss: 0.167853
Epoch 68, train loss: 0.167881
Epoch 69, train loss: 0.167909
Epoch 70, train loss: 0.167900
Epoch 71, train loss: 0.167648
Epoch 72, train loss: 0.167484
Epoch 73, train loss: 0.167283
Epoch 74, train loss: 0.167155
Epoch 75, train loss: 0.166659
Epoch 76, train loss: 0.166454
Epoch 77, train loss: 0.166222
Epoch 78, train loss: 0.165525
Epoch 79, train loss: 0.165102
Epoch 80, train loss: 0.164744
Epoch 81, train loss: 0.164375
Epoch 82, train loss: 0.163608
Epoch 83, train loss: 0.162914
Epoch 84, train loss: 0.161932
Epoch 85, train loss: 0.161376
Epoch 86, train loss: 0.160754
Epoch 87, train loss: 0.159884
Epoch 88, train loss: 0.158790
Epoch 89, train loss: 0.157848
Epoch 90, train loss: 0.157015
Epoch 91, train loss: 0.156096
Epoch 92, train loss: 0.155086
Epoch 93, train loss: 0.153951
Epoch 94, train loss: 0.153131
Epoch 95, train loss: 0.152282
Epoch 96, train loss: 0.151148
Epoch 97, train loss: 0.150362
Epoch 98, train loss: 0.149626
Epoch 99, train loss: 0.148672
Test loss: 0.135725
Epoch 36, train loss: 0.298249
Epoch 37, train loss: 0.274904
Epoch 38, train loss: 0.254414
Epoch 39, train loss: 0.237091
Epoch 40, train loss: 0.222672
Epoch 41, train loss: 0.210963
Epoch 42, train loss: 0.201675
Epoch 43, train loss: 0.194239
Epoch 44, train loss: 0.188320
Epoch 45, train loss: 0.183862
Epoch 46, train loss: 0.180396
Epoch 47, train loss: 0.177766
Epoch 48, train loss: 0.175831
Epoch 49, train loss: 0.174371
Epoch 50, train loss: 0.173208
Epoch 51, train loss: 0.172292
Epoch 52, train loss: 0.171528
Epoch 53, train loss: 0.170931
Epoch 54, train loss: 0.170372
Epoch 55, train loss: 0.169860
Epoch 56, train loss: 0.169431
Epoch 57, train loss: 0.169098
Epoch 58, train loss: 0.168640
Epoch 59, train loss: 0.168377
Epoch 60, train loss: 0.168175
Epoch 61, train loss: 0.167934
Epoch 62, train loss: 0.167714
Epoch 63, train loss: 0.167610
Epoch 64, train loss: 0.167453
Epoch 65, train loss: 0.167588
Epoch 66, train loss: 0.167484
Epoch 67, train loss: 0.167415
Epoch 68, train loss: 0.167231
Epoch 69, train loss: 0.167210
Epoch 70, train loss: 0.167155
Epoch 71, train loss: 0.167094
Epoch 72, train loss: 0.167094
Epoch 73, train loss: 0.166981
Epoch 74, train loss: 0.166908
Epoch 75, train loss: 0.166577
Epoch 76, train loss: 0.166424
Epoch 77, train loss: 0.166001
Epoch 78, train loss: 0.165904
Epoch 79, train loss: 0.165673
Epoch 80, train loss: 0.165229
Epoch 81, train loss: 0.164963
Epoch 82, train loss: 0.164665
Epoch 83, train loss: 0.163869
Epoch 84, train loss: 0.163446
Epoch 85, train loss: 0.163232
Epoch 86, train loss: 0.162697
Epoch 87, train loss: 0.162034
Epoch 88, train loss: 0.161553
Epoch 89, train loss: 0.161064
Epoch 90, train loss: 0.160425
Epoch 91, train loss: 0.159810
Epoch 92, train loss: 0.158818
Epoch 93, train loss: 0.157876
Epoch 94, train loss: 0.157359
Epoch 95, train loss: 0.156626
Epoch 96, train loss: 0.155703
Epoch 97, train loss: 0.154973
Epoch 98, train loss: 0.153982
Epoch 99, train loss: 0.153110
Test loss: 0.120310
Epoch 36, train loss: 0.300487
Epoch 37, train loss: 0.275923
Epoch 38, train loss: 0.254575
Epoch 39, train loss: 0.236561
Epoch 40, train loss: 0.221287
Epoch 41, train loss: 0.209024
Epoch 42, train loss: 0.199025
Epoch 43, train loss: 0.191500
Epoch 44, train loss: 0.185683
Epoch 45, train loss: 0.181392
Epoch 46, train loss: 0.178108
Epoch 47, train loss: 0.175721
Epoch 48, train loss: 0.173878
Epoch 49, train loss: 0.172485
Epoch 50, train loss: 0.171423
Epoch 51, train loss: 0.170543
Epoch 52, train loss: 0.169804
Epoch 53, train loss: 0.169172
Epoch 54, train loss: 0.168643
Epoch 55, train loss: 0.168182
Epoch 56, train loss: 0.167732
Epoch 57, train loss: 0.167378
Epoch 58, train loss: 0.167064
Epoch 59, train loss: 0.166732
Epoch 60, train loss: 0.166625
Epoch 61, train loss: 0.166398
Epoch 62, train loss: 0.166235
Epoch 63, train loss: 0.166188
Epoch 64, train loss: 0.166083
Epoch 65, train loss: 0.165995
Epoch 66, train loss: 0.166053
Epoch 67, train loss: 0.166006
Epoch 68, train loss: 0.166067
Epoch 69, train loss: 0.165881
Epoch 70, train loss: 0.165780
Epoch 71, train loss: 0.165692
Epoch 72, train loss: 0.165527
Epoch 73, train loss: 0.165470
Epoch 74, train loss: 0.165338
Epoch 75, train loss: 0.165270
Epoch 76, train loss: 0.164954
Epoch 77, train loss: 0.164750
Epoch 78, train loss: 0.164363
Epoch 79, train loss: 0.163997
Epoch 80, train loss: 0.163781
Epoch 81, train loss: 0.163450
Epoch 82, train loss: 0.162972
Epoch 83, train loss: 0.162452
Epoch 84, train loss: 0.161788
Epoch 85, train loss: 0.161022
Epoch 86, train loss: 0.160048
Epoch 87, train loss: 0.159097
Epoch 88, train loss: 0.158337
Epoch 89, train loss: 0.157554
Epoch 90, train loss: 0.156543
Epoch 91, train loss: 0.155570
Epoch 92, train loss: 0.154876
Epoch 93, train loss: 0.153755
Epoch 94, train loss: 0.152859
Epoch 95, train loss: 0.151987
Epoch 96, train loss: 0.151304
Epoch 97, train loss: 0.150257
Epoch 98, train loss: 0.149417
Epoch 99, train loss: 0.148764
Test loss: 0.197296
50-fold validation: Avg train loss: 0.154830, Avg test loss: 0.160569
By looking at the average values,our stacked model is the best.

Conclusions and Remarks

  • The best model seems to be the stacked one.
  • Other ideas would be to following a bayesian inference approach and implementing a gaussian process which we will not do for the lack of time and since we have a large dataset and inverting it takes time.
  • Deepnets could produce better results but need a lot of tweaking and experimenting.
  • Getting a lower RMSE needs experimenting with various things. We have no perfect data not perfect data that can give us perfect results,we just have to try.
  • Some cells were not run again since they take too much time.

References


A big thanks goes to Prof. Pietro Michiardi and the teaching assistant for answering my questions and guiding me.

The following is al ist of references we used:
https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/
https://machinelearningmastery.com/an-introduction-to-feature-selection/
http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf
https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/
https://codingstartups.com/practical-machine-learning-ridge-regression-vs-lasso/
https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/https://codingstartups.com/practical-machine-learning-ridge-regression-vs-lasso/
https://www.kaggle.com/erikbruin/house-prices-lasso-xgboost-and-a-detailed-eda
https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard
http://blog.kaggle.com/2017/06/15/stacking-made-easy-an-introduction-to-stacknet-by-competitions-grandmaster-marios-michailidis-kazanova/
https://www.kaggle.com/agehsbarg/top-10-0-10943-stacking-mice-and-brutal-force
https://www.kaggle.com/humananalog/xgboost-lasso
https://www.kaggle.com/apapiu/regularized-linear-models
http://ww2.amstat.org/publications/jse/v19n3/Decock/DataDocumentation.txt
https://www.kaggle.com/harunshimanto/house-price-prediction